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Preface 



In the design, implementation, and operational planning of computer and com- 
munication systems, many questions regarding the desired capacity and speed of 
(sub)systems have to be answered. At this point, performance and dependability 
evaluation techniques can be of great help. With these techniques, design deci- 
sions can be prepared using advanced methods to construct appropriate models, 
to parameterise these models, and to solve them. The application of a broad spec- 
trum of such methods and techniques is currently supported by tools (mostly 
software, but sometimes partly hardware as well). Such tools enable system de- 
signers and engineers to construct their models in a flexible and modular way 
using high-level application-oriented modelling languages, to solve their models 
with a variety of techniques and to exercise parametric studies at ease. 

The goal of the 11th International Conference on Modelling Tools 
and Techniques for Computer and Communication System Perfor- 
mance Evaluation (“TOOLS 2000”) was to further develop the theory and 
technology for tool-based performance and dependability evaluation of computer 
and communication systems. Important themes included software tools, evalua- 
tion techniques, measurement-based tools and techniques, performance and de- 
pendability evaluation techniques based on formal methods, case studies showing 
the role of evaluation in the design of systems, and application studies in the 
area of centralised and distributed computer systems. 

Previous conferences in this series were held over the past 15 years in Paris 
(1984), Sophia Antipolis (1985), Paris (1987), Palma de Mallorca (1988), Torino 
(1991), Edinburgh (1992), Wien (1994), Heidelberg (1995), Saint Malo (1997), 
and Palma de Mallorca (1998). The proceedings of the latter four conferences 
also appeared in the series Lecture Notes in Computer Science (Volumes 794, 
977, 1245, and 1469, respectively). 

TOOLS 2000 has been unique in a number of ways. First of all, it was the 
first conference in this series held outside of Europe. More precisely, TOOLS 
2000 was hosted by Motorola at the Galvin Center in Schaumburg (Illinois, 
USA). Secondly, TOOLS 2000 was, for the first time, organised in conjunction 
with the IEEE International Performance and Dependability Symposium (“IPDS 
2000” ) . This allowed for a number of joint components in the programmes: the 
two invited speakers, the tool demonstrations, a panel discussion, the tutorial 
programme, and the social events were all shared between TOOLS 2000 and 
IPDS 2000 participants. Moreover, participants to the event could attend sessions 
of either conference. 

For TOOLS 2000, the programme committee enjoyed 49 regular submissions, 
which were all sent to four PC members for review. Around 95% of all requested 
reviews were returned in time for the internet-based programme committee meet- 
ing which was held in the first week of December 1999. At this point, it is worth 
mentioning that the electronic programme committee meeting functioned very 
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well and agreement was reached rapidly on most of the papers. The use of the 
WIMPE system (developed by David Nicol) to manage the paper submission 
and reviewing process turned out to be very convenient and is recommended for 
the future. 

As a result of the programme committee meeting, 21 high-quality submis- 
sions were selected as regular papers. Thus, the programme featured sessions on 
queueing network models, stochastic Petri nets, simulation techniques, formal 
methods and performance evaluation, measurement techniques and applications, 
and optimisation techniques in mobile networks. Alongside these regular paper 
sessions, the conference included two sessions in which 15 tools (accepted by 
the tool demonstration chair) were briefly presented. Of most of these tools, a 
short description is included in these proceedings as well. The conference was 
completed with invited presentations by Mark Crovella (Boston University) and 
Leon Alkalai (Jet Propulsion Laboratory) for TOOLS 2000 and IPDS 2000, re- 
spectively. 

We thank the programme and steering committee members, as well as the 
reviewers they assigned, for the wonderful task they did in preparing all the re- 
views within a very short time. We also thank the participants of the electronic 
programme committee meeting for their contribution in selecting the right pa- 
pers. We thank Motorola for hosting this joint conference. Finally, most impor- 
tantly, we thank all the authors of submitted papers. Without their submissions, 
this conference would not exist! We hope you found the conference fruitful and 
inspiring! 
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Performance Evaluation 
with Heavy Tailed Distributions 
(Extended Abstract) 



Mark E. Crovella 

Department of Computer Science 
Boston University 
111 Cummington St. 
Boston MA USA 02215 
crovellaScs .bu.edu 



1 Introduction 

Over the last decade an important new direction has developed in the perfor- 
mance evaluation of computer systems: the study of heavy-tailed distributions. 
Loosely speaking, these are distributions whose tails follow a power-law with low 
exponent, in contrast to traditional distributions {e.g., Gaussian, Exponential, 
Poisson) whose tails decline exponentially (or faster). In the late ’80s and early 
’90s experimental evidence began to accumulate that some properties of com- 
puter systems and networks showed distributions with very long tails 
and attention turned to heavy-tailed distributions in particular in the mid ’90s 
|:ii9i:j:ii:i4i42| . 

To define heavy tails more precisely, let Ai be a random variable with cu- 
mulative distribution function F(x) = P[X < x\ and its complement F{x) = 
1 — F{x) = P\X > x]. We say here that a distribution F{x) is heavy tailed if 

F{x) ~ cx~°‘ 0 < a < 2 (1) 

for some positive constant c, where a{x) ~ b{x) means lima,^oo a(a;)/6(a;) = 1. 
This definition restricts our attention somewhat narrowly to distributions with 
strictly polynomial tails; broader classes such as the subexponential distributions 
m can be defined and most of the qualitative remarks we make here apply to 
such broader classes. 

Heavy tailed distributions behave quite differently from the distributions 
more commonly used in performance evaluation {e.g., the Exponential). In par- 
ticular, when sampling random variables that follow heavy tailed distributions, 
the probability of very large observations occurring is non-negligible. In fact, 
under our definition, heavy tailed distributions have infinite variance, reflect- 
ing the extremely high variability that they capture; and when a < 1, these 
distributions have infinite mean. 



B.R. Haverkort et al. (Eds.): TOOLS 2000, LNCS 1786, pp. l4^ 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 
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2 Evidence 

The evidence for heavy-tailed distributions in a number of aspects of computer 
systems is now quite strong. The broadest evidence concerns the sizes of data 
objects stored in and transferred through computer systems; in particular, there 
is evidence for heavy tails in the sizes of: 

— Files stored on Web servers m-, 

— Data files transferred through the Internet PM] : 

— Files stored in general-purpose Unix filesystems (2^; and 

— I/O traces of filesystem, disk, and tape activity I21I36I37I38I 

This evidence suggests that heavy-tailed distributions of data objects are 
widespread, and these heavy-tailed distributions have been implicated as an 
underlying cause of self- similarity in network traffic |9I29I33I42| . 

Next, measurements of job service times or process execution times in 
general-purpose computing environments have been found to exhibit heavy tails 
|17I23I27| . 

A third area in which heavy tails have recently been noted is in the distribu- 
tion of node degree of certain graph structures. Faloutsos et al. jl4] show that 
the inter-domain structure of the Internet, considered as a directed graph, shows 
a heavy-tailed distribution in the outdegree of nodes. Another study shows that 
the same is true (with respect to both indegree and outdegree) for certain sets of 
World Wide Web pages which form a graph due to their hyperlinked structure 
|T]; this result has been extended to the Web as a whole in [^. 

Finally, a phenomenon related to heavy tails is the so-called Zipf’s Law [d^. 
Zipf’s Law relates the “popularity” of an object to its location in a list sorted 
by popularity. More precisely, consider a set of objects (such as Web servers, 
or Web pages) to which repeated references are made. Over some time interval, 
count the number of references made to each object, denoted by R. Now sort 
the objects in order of decreasing number of references made and let an object’s 
place on this list be denoted by n. Then Zipf’s Law states that 

R = cn~^ 

for some positive constants c and p. In its original formulation, Zipf’s Law set 
/3 = 1 so that popularity (i?) and rank (n) are inversely proportional. In prac- 
tice, various values of /3 are found, with values often near to or less than 1. Evi- 
dence for Zipf’s Law in computing systems (especially the Internet) is widespread 
j2II3ll8l3T] : a good overview of such results is presented in |H]. 

3 Implications of Heavy Tails 

Unfortunately, although heavy-tailed distributions are prevalent and important 
in computer systems, their unusual nature presents a number of problems for 
performance analysis. 
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The fact that even low-order distributional moments can be infinite means 
that many traditional system metrics can be undefined. As a simple example, 
consider the mean queue length in an M/G/1 queue, which (by the Pollaczek- 
Khinchin formula) is proportional to the second moment of service time. Thus, 
when service times are drawn from a heavy-tailed distribution, many properties 
of this queue (mean queue length, mean waiting time) are infinite. Observations 
like this one suggest that performance analysts dealing with heavy tails may need 
to turn their attention away from means and variances and toward understanding 
the full distribution of relevant metrics. Most early work in this direction has 
focused on the shape of the tail of such distributions {e.g., m- 

Some heavy-tailed distributions apparently have no convenient closed-form 
Laplace transforms {e.g., the Pareto distribution), and even for those distribu- 
tions possessing Laplace transforms, simple systems like the the M/G/1 must 
be evaluated numerically, and with considerable care [3S]. 

In practice, random variables that follow heavy tailed distributions are char- 
acterized as exhibiting many small observations mixed in with a few large ob- 
servations. In such datasets, most of the observations are small, but most of the 
contribution to the sample mean or variance comes from the rare, large obser- 
vations. This means that those sample statistics that are defined converge very 
slowly. This is particularly problematic for simulations involving heavy tails, 
which many be very slow to reach steady state m- 

Finally, because arbitrarily large observations can not be ruled out, issues of 
scale should enter in to any discussion of heavy tailed models. No real system 
can experience arbitrarily large events, and generally one must pay attention 
to the practical upper limit on event size, whether determined by the timescale 
of interest, the constraints of storage or transmission capacity, or other system- 
defined limits. On the brighter side, a useful result is that it is often reasonable 
to substitute finitely-supported distributions for the idealized heavy-tailed dis- 
tributions in analytic settings, as long as the approximation is accurate over the 
range of scales of interest II6I20I22I . 

4 Taking Advantage of Heavy Tails 

Despite the challenges they present to performance analysis, heavy tailed distri- 
butions also exhibit properties that can be exploited in the design of computer 
systems. Recent work has begun to explore how to take advantage of the presence 
of heavy tailed distributions to improve computer systems’ performance. 

4.1 Two Important Properties 

In this regard, there are two properties of heavy tailed distributions that offer 
particular leverage in the design of computer systems. The first property is re- 
lated to the fact that heavy tailed distributions show declining hazard rate, and 
is most concisely captured in terms of conditional expectation: 



E[X\X > fc] ~ fc 
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when X is a heavy tailed random variable and k is large enough to be “in the 
tail.” We refer to this as the expectation paradox, after [30l p. 343]; it says that 
if we are making observations of heavy-tailed interarrivals, then the longer we 
have waited, the longer we should expect to wait. (The expectation is undefined 
when a < 1, but the general idea still holds.) This should be contrasted with 
the case when the underlying distribution has exponential tails or has bounded 
support above (as in the uniform distribution); in these cases, eventually one 
always gets to the point where the longer one waits, the less time one should 
expect to continue waiting. 

The second useful property of heavy tailed distributions we will call the 
mass- count disparity. This property can be stated formally as m- 



P[Xi + ... + Xr,> x] 

lim — — 

x^cc P[max(Ai, ..., X„) > x\ 



= 1 for all n > 2 



which is the case when the Xi are i.i.d. positive random variables drawn from a 
heavy-tailed distribution. This property states that when considering collections 
of observations of a heavy-tailed random variable, the aggregated mass contained 
in the small observations is negligible compared to the largest observation in 
determining the likelihood of large values of the sum. 

In practice this means that the majority of the mass in a set of observations 
is concentrated in a very small subset of the observations. This can be visualized 
as a box into which one has put a few boulders, and then filled the rest of the 
way with sand. This mass-count disparity means that one must be careful in 
“optimizing the common case” |26| . The typical observation is small; the typical 
unit of work is contained in a large observation. 

This disparity can be studied by defining the mass-weighted distribution 
function: 



r“ u dFiu) 

F (x) - — 

J.^^vdFiv) 



(2) 



and comparing F^^x) with F(x). Varying x over its valid range yields a plot of 
the fraction of total mass that is contained in the fraction of observations less 
than X. An example of this comparison is shown in Figure [TJ This figure shows 
Fw{x) vs. F(x) for the Exponential distribution, and for a particular heavy-tailed 
distribution. The heavy-tailed distribution is chosen to correspond to empirical 
measurements of file sizes in the World Wide Web |1]; it has a = 1.0. Since 
the denominator in 0 is infinite for heavy tailed distributions with a < 1, the 
actual distribution used has been truncated to span six orders of magnitude — 
which is reasonable for file size distributions (which can range in size from bytes 
to megabytes). 

The figure shows that for the Exponential distribution, the amount of mass 
contained in small observations is roughly commensurate with the fraction of 
total observations considered; i.e., the curve is not too far from the line y = x. 
On the other hand, for the heavy tailed distribution, the amount of mass is not 
at all commensurate with the fraction of observations considered; about 60% of 
the mass is contained in the upper 1% of the observations! This is consistent 
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Fig. 1. Total Mass as a Function of Smallest Observations 



with results in m showing that 50-80% of the bytes in FTP transfers are due 
to the largest 2% of all transfers. 

4.2 Exploiting the Heavy Tail Properties 

Once these properties are understood, they can be exploited in a number of ways 
to improve system performance. This section summarizes some (though not all) 
recent attempts to do this. 

Load Balancing in Distributed Systems In some distributed systems, tasks can 
be pre-empted and moved from one node to another, which can improve load 
balance. However, the cost of migration is not trivial and can outweigh perfor- 
mance gains from improved load balance if not used carefully. In |23| . the authors 
show that previous assessments of the potential for pre-emptive migration had 
mainly used exponential tasks size assumptions and concluded that the potential 
gains from task migration were small. However, once the task size distribution 
is understood to be heavy-tailed, two benefits emerge: 1) the mass-count dis- 
parity means that relative few tasks need to be migrated to radically improve 
load balance; and 2) the expectation paradox means that a task’s lifetime to 
date is a good predictor of its expected future lifetime. Taken together, these 
two benefits form the foundation for a enlightened load balancing policy that 
can significantly improve the performance of a wide class of distributed systems. 

When pre-emption is not an option, understanding of heavy tailed distribu- 
tions can still inform load balancing policies. The question in these systems is 
“which queue should an arriving task join?” In the case when service at the nodes 
is FCFS, and knowledge is available about the size of the arriving task, the best 
policy is commonly assumed to be joining the queue with the shortest expected 
delay m although this is known to be best only for task size distributions with 
increasing failure rate. In |23], the authors show a better policy for the case in 
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which task sizes have a heavy-tailed distribution, which they call SITA-E. The 
idea is to assign an incoming task to a queue based on the incoming task’s size. 
Each queue handles tasks whose sizes lie in a contiguous range, and ranges are 
chosen so as to equalize load in expectation. This policy is shown to significantly 
outperform shortest-expect-delay assignment, when 1 < a < 2. The benefits 
of the policy accrue primarily from the the mass-count disparity in task sizes: 
grouping like tasks together means that the vast majority of tasks are sent to 
only a few queues; at these queues, task size variability is dramatically reduced 
and so FCFS service is very efficient. 

Finally, in another paper isd], the authors show that in the same setting 
(distributed system of FCFS servers, task sizes are heavy tailed, and incoming 
task sizes are known) the expected slowdown metric is optimized by policies that 
do not balance load. (Slowdown is defined as a job’s waiting time in queue divided 
by its service demand.) This is possible because of the mass-count disparity; when 
most tasks are sent to only a few queues, reducing the load at those queues 
decreases the slowdown experienced at those queues. In this case, most tasks 
experience decreased slowdown, while the relatively few large tasks experience 
only slightly increased slowdown. In expectation, slowdown is decreased. 

Scheduling in Web Servers In single-node systems, attention has been given to 
the scheduling issue. Most systems use a variant of timesharing to schedule tasks, 
possibly incorporating multilevel feedback; this is effective when task sizes are 
unknown. In | 22| . the authors argue that Web servers are in a unusual position; 
they can estimate task size upon task arrival because, for static Web pages, 
the file size is known at request time. As a result, they argue for the use of 
shortest-remaining-processing-time (SRPT) scheduling within Web servers. One 
significant drawback of SRPT is that it improves the response time of small tasks 
at the expense of large tasks; however the authors argue that this is acceptable 
when tasks follow heavy-tailed distributions such as are encountered in the Web. 
The reason is that the mass-count disparity means that under SRPT, although 
large tasks are interrupted by small tasks, the small tasks represent only a minor 
fraction of total system load. Thus the great majority of tasks have their response 
time improved, while the relatively few large tasks are not seriously punished. In 
m the authors describe an actual Web server implemented to use this scheduling 
policy. The paper shows evidence that the new server exhibits mean response 
times 4-5 times lower than a popularly deployed server (Apache); and that the 
performance impacts on large tasks are relatively mild. 

Routing and Switching in the Internet In Internet traffic management, a number 
of improved approaches to routing and switching have been proposed, based on 
the observation that the lengths of bulk data fiows in the Internet exhibit heavy 
tails. 

One promising routing technique is to use switching hardware, by creating 
shortcuts (temporary circuits) for long sequences of packets that share a common 
source and destination. Shortcuts provide the benefits of fast switch-based rout- 
ing, at the expense of network and switch overhead for their setup. The authors 
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in argue that Web traffic can be efficiently routed using this technique. Their 
results rely on the mass-count disparity, showing that the majority of the bytes 
can be routed by creating shortcuts for only a small fraction of all data flows. 
They show that in some settings, a setup threshold of 25 packets (the number 
of same-path packets to observe before creating a switched connection) is suf- 
ficient to eliminate 90% of the setup costs while routing more than 50% of the 
bytes over switched circuits. The choice of threshold implicitly makes use of the 
expectation paradox: longer thresholds can be used to offset larger setup costs, 
since longer thresholds identify flows whose expected future length is longer as 
well. 

Another proposed routing technique is load-sensitive routing. Load sensitive 
routing attempts to route traffic around points of congestion in the network; 
current Internet routing only makes use of link state (up or down). Unfortunately, 
load-sensitive routing can be expensive and potentially unstable if applied to 
every routing decision. However, the authors in m show that if applied only 
to the long-lived flows, it can be efficient and considerably more stable. The 
success of this technique relies on the heavy tailed distribution of Internet flows: 
the mass-count disparity means that a large fraction of bytes can be routed by 
rerouting only a small fraction of the flows; and the expectation paradox allows 
the policy to observe a flow for some period of time to classify it as a long flow. 
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Abstract. Understanding the interactions between hardware and soft- 
ware is important to performance in many systems found in data com- 
munications like routers. Responsibilities that traditionally were pro- 
grammed in software are being transferred to intelligent devices, and 
special purpose hardware. With more functionality being transferred to 
these devices, it becomes increasingly important to capture them in per- 
formance models. Modeling hardware/software systems requires an ex- 
tended queueing model like LQN. This paper describes a layered archi- 
tecture model which represents hardware and software uniformly and 
which emphasizes resources and performance, called a Resource-based 
Model Architecture (RMA). The approach is demonstrated on a remote 
access or LAN extension router. The model is created by a systematic 
tracing of scenarios and is used to explore the router capacity for different 
workloads, and to analyze a re-design for scaleup. 



1 Introduction 

Interaction between hardware and software resources are important to the per- 
formance of many systems, for example in data communications, where logical 
operations are partitioned between software tasks and intelligent interface de- 
vices or processors. One difficulty in modeling these systems is they are not well 
represented by ordinary queueing models, and require an extended queueing 
model. Layered Queueing m provides a systematic framework which simplifies 
the model construction and solution. 

This paper describes a “Resource-based Modeling Architecture” (RMA) for 
modeling this class of system. It demonstrates the approach with a study of a 
small LAN Extension Router (LAN/ER), in which the model is used to estimate 
performance and evaluate design trade-offs. 

The model is created in two stages. First the layered Resource-based Model 
Architecture (RMA) is created by inspecting the hardware and software and 
tracing the involvement of components in scenarios. Then the parameters of the 
architecture are determined by various means, including profiling. The architec- 
ture itself gives guidance as to what to collect and how to combine the parameter 
information. The model is completed by adding environmental information. 
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The RMA model is an abstraction that shows how the software and hardware 
together create delays and bottlenecks under different workloads. The model is 
used to predict performance characteristics like capacity and delay, using either 
analytic or simulation techniques |17| : in this case simulation was used. 

The first purpose of the model is evaluation of the capability of the given 
design. The second purpose is to expose performance bugs, by establishing ex- 
pectations for performance tests. The third purpose is to provide insight to guide 
the evolution of the design. The special value of the layered structure of RMA is 
that it conceptually organizes a complicated fiat assembly of queues and func- 
tions, according to how they compete for resources. This gives insight into the 
effects that software and hardware have on each other. 

2 Application of RMA to a LAN Extension Router 

Routers make extreme demands on the performance of software and hardware, 
to execute communication protocols and direct the traffic to the correct destina- 
tion network. To increase the performance of routers, research has been done in 
improving various aspects of the software and hardware. Examples of software 
research include router table look ups to significantly speed up rout- 

ing decisions, heuristic techniques for implementation of protocols PI, general 
ways of handling packets in routers efficiently [4j , and comparisons of queueing 
strategies m- Hardware research has proposed different architectures to deal 
with the increasing demands 0, E, m, P!. m, PI- Each of these studies is 
focused on using a single aspect of router design to enhance performance. How- 
ever the overall performance comes from the interaction of many factors, and 
RMA provides an approach to study the overall combination. 




Outbound Traffic 

LAN 

10 Mbits/sec 



LAN 

Extension 

Router 



Remote User(s) 




Inbound Traffic 



B Channel 
64Kbits/sec 



ISDN Network 



Fig. 1. High level diagram of the LAN/ER’s major input and output ports 



The router studied here and shown in Figure 1 is a small device, not neces- 
sarily at the cutting edge of technology, but it nicely illustrates the use of the 
modeling approach. It extends the LAN on the left so that the remote users on 
the right (which are connected to the router by ISDN links) operate as if they 
were connected to the LAN directly. The users have a Basic Rate Interface ISDN 
service, which can consist of one or two B channels. The LAN/ER is capable of 
handling up to 120 B channel connections at one time. 
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The LAN/ER is also connected to an 10 Mbits/sec LAN network which uses 
a standard Ethernet protocol. Frames being received on the ISDN network side 
can be bridged to the LAN, or be routed back to another B channel on the 
ISDN network side. Conversely packets arriving on the LAN will be routed to 
the appropriate B channel. Each B channel uses a synchronous Point to Point 
Protocol (PPP) at the network layer, and a HDLC protocol at the data link 
layer. 

2.1 A More Detailed Description of the LAN Extension Router 

The LAN/ER has a single bus, single processor architecture as shown in Figure 2. 



LAN connection 




B Channels (ISDN network) 



Fig. 2. Hardware interconnection in the LAN/ER system 



There are four ISDN interface chips, each supporting 32 B channels, which 
provide a full duplex ISDN interface and some of the processing for the HDLC 
protocol. Frames which are received and frames to be transmitted are placed in 
the ISDN buffers in main memory. When the ISDN chip is not busy it polls the 
transmit buffers in main memory every 125 /is to check if any new frames to 
transmit have appeared. The FIFO buffers within the ISDN chip hold 8 bytes 
for each direction (receive and transmit) for each B channel. To empty its receive 
FIFO buffer, or fill its FIFO transmit buffer, the chip makes a request for the 
bus and executes a DMA operation on the main memory. 

The LAN chip supports the 10 Mbits/sec IEEE 802.3 Ethernet protocol. Its 
operations are similar to the ISDN chip, except that when it is idle it does not 
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poll, but instead waits for commands from a queue in main memory which tells 
the chip what operations to perform. The LAN chip has an on board FIFO buffer 
of 32 bytes for each of the receive and transmit directions. 

The CPU executes a Worker task to process frames found in the ISDN buffers 
and transfer them into the LAN buffers, and vice versa. It runs a real-time kernel 
that manages the Worker task and ten other tasks which do the management 
and maintenance of the system. Their CPU utilization is much smaller than that 
of the Worker task. Access to the main memory with all the data buffers as well 
as program storage is controlled by the bus arbiter. It uses a pre-emptive “hold 
till done” priority scheme. The priority order, from low to high is CPU, LAN 
chip, ISDN chip. The DMA devices can pre-empt the CPU in use of the bus, 
but once any DMA device has the bus, it is not pre-emptable. ISDN requests 
for the bus are blocked after 4 requests from 4 different ISDN chips, to give fair 
bus access to the LAN chip. 

The freehand lines running through Figure 3 trace four important scenarios 
using the Use Case Map notation [ 3 ]. Each scenario starts from a filled circle, 
representing a triggering event, and shows the order in which components are 
involved in completing it. 



Outbound Data Scenario (LAN to ISDN) and Acknowledgment. Be- 
ginning from the filled circle at the top of Figure 3, a packet is first received by 
the LAN chip. Incoming bytes are stored in the on-chip FIFO buffer and then 
transferred to the LAN RX buffers. Eventually the Worker task will check the 
LAN RX buffers for complete received packets, and process them. First it deter- 
mines which B channel the packet is destined for, by calling the bridge/routing 
functions. Then the packet is queued to its appropriate B channel queue (ISDN 
queue). When the B channel is able to accept data, the HDLC portion of the PPP 
stack (service J3DLC_TX) processes the packet to a frame. The frame is con- 
verted by the process JSDN_TX_buffers operation into a format that the ISDN 
chip can understand and is stored in the ISDN TX buffers. The ISDN chip 
polls these buffers, and when it finds a new frame ready to transmit, it DMAs 
the frame 8 bytes at a time into its appropriate B channel buffer. Depending 
on the sliding flow control window of HDLC on the link, an acknowledgment 
frame will be received eventually. Following the dark path up from the bottom, 
it is processed the same way as an inbound frame to the point when it reaches 
the HDLC_service_RX function, where the acknowledgment is processed and the 
frame is discarded. 



Inbound Data Scenario (ISDN to LAN) and Acknowledgment. The 

ISDN to LAN scenario starts when a frame arrives from a remote user, on 
a B channel. The incoming bytes are stored by the ISDN chip into a FIFO 
buffer for that B channel. As the FIFO buffer fills up, the chip will transfer 
its contents into one of the ISDN RX buffers in the main memory. When the 
frame is complete, the Worker task will eventually assemble the frame for that 
B channel through the processJSDN_RX_buffers function. The frame is then 
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sent to the service_HDLC_RX function, which processes the packet, and updates 
any state information. Depending on the sliding window of HDLC on the link, 
an acknowledgment may be sent to the remote user for the successful reception 
of frame(s). The frame is then sent to the upper layer of the PPP stack for 
bridging/routing, and is deposited into the LAN transmit queue. The frame is 
taken from the LAN queue and moved into the LAN chip transmit buffers, and 
the LAN chip is instructed by the Worker task to send the packet off to the 
LAN. 




Fig. 3. Use Case Map of four scenarios 
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Worker Task Operation. The Worker task polls the various queues and 
buffers shown in Figure 3, using a polling sequence described by pseudo-code 
in Figure 4. Each service function is repeated until the given buffer is empty. 



while (TRUE) 

{ 

for(x=l; X <= number of active channels; x++) 
{ 

hdlc_service_RX(x) ; 
hdlc_service_TX(x) ; 

process_ISDN_RX_buffers(x) ; 
process_ISDN_TX_buffers(x) ; 



service_LAN_TX() ; 
service_LAN_RX() ; 

sleep until awakened by Null task or timer tick 

} 



Fig. 4. High level pseudo code describing the Worker task polling order 



3 Derivation of the Performance Model 

Once the scenarios have been defined, as they have been in section |2] the layered 
model is developed in four steps: 

1 . Obtain sub- models by tracing out scenarios and grouping responsibilities to 
resources. 

2. Merge sub-models to get a complete model of the system identifying all the 
entities. 

3. Add any missing operational detail. 

4. Optionally, simplify the model. 

3.1 Step 1: Obtaining Sub-models from Scenarios 

In the first step, the system entities which are also resources are identified. These 
are the separate devices (LAN and ISDN interface chips, CPU, bus, memory) and 
the software processes (in this case, only the worker task to begin with). In the 
layered modelling notation of |S] these resource entities are termed “tasks” and 
the different operations they perform are termed “entries” . Along the scenario, 
operations are grouped by task to define the work of each entry. This gives the 
resource layer in Figure 5. There is one entry per task, in this scenario. Resources 
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provisioned in multiple copies (such as the four ISDN chips in Figure 5) are shown 
as a stack of task shapes. 

Additional tasks in the model can be used to define the generation of work- 
load (e.g. the LAN driver task) and the sequencing of operations by the resource 
tasks (e.g. the LAN_op task). These are pseudo-tasks in that they do no work, 
consume no CPU time and do not represent system entities as such. The N LAN 
driver tasks represent N users each generating a stream of input packets. The 
LAN_op tasks call the resource operations in turn, with synchronous interac- 
tions (shown by the filled arrowheads), so that the LAN_op task waits for each 
resource task to finish before invoking the next one. The response time for one 
instance of the operation task is the total delay to traverse the router, which is 
reported later as the basic performance measure. 

The second part of Step 1 is to add lower layers for resources and operations 
which are used by the operations taken from the scenario. These may be divided 
into software layers (which can include the firmware in the hardware devices) 
and hardware layers which show the physical devices. 

Figure 6 shows the result. The ISDN and LAN chips are divided into a 
software part and a hardware part, and the CPU is shown as being used by the 
Worker task. Then all three devices in the first hardware layer contend for the 
bus via the arbiter, and when they have the bus they access memory. Notice 
that in this design all the interactions are via memory and polling. 




Fig. 5. Step 1 (first part): Grouping of responsibilities to resources for LAN to 
ISDN scenario 



Step I has to be applied to each scenario separately before proceeding to step 
2 (the other scenarios are will not be described in detail). 

3.2 Step 2: Merge Sub-Models to Create a Single Model 
of the System 

Each scenario gives a sub-model. In Step 2 these are merged by identifying the 
same task used in different scenarios, as in Figure 7. There is a driver task and 
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Driver Layer 



Operation Layer 



Software Layer 



First Hardware Layer 



Second Hardware Layer 



Third Hardware Layer 



Fig. 6. Step 1 (second part): Expanded submodel which includes all resources 
for LAN to ISDN scenario 



operation task for each scenario. The tasks in the software layer and lower layers 
have separate entries for the operations in each scenario, which are treated in the 
model as different classes service by the resource. However the separate entries 
are not shown in this Figure, to avoid clutter. 

3.3 Step 3: Adding Behavioral Detail to Model 

Steps 1 and 2 have captured the behavior defined by the Use Case Maps found 
in Figure 3. This step will now add the behavioral detail which was ignored in 
the Use Case Maps. 

Maintenance Tasks. In section [2 there was mention of other tasks (main- 
tenance) which run on the system when the Worker task is idle. The model 
represents them by a single aggregate Maintenance task which is triggered once 
in each poll cycle by an entry Sleep in the Worker task. 



Main Bus Arbitration. The arbitration mechanism for the main bus as de- 
scribed in section I2TT1 has priorities with some pre-emption. This is approximated 
by a pre-emptive priority scheduling which is supported by the layered modeling 
tools. In the model when the ISDN, CPU and LAN tasks make requests to the 
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Fig. 7. Layered model covering all four scenarios 



bus task, their requests are handled by priority. The ISDN and LAN bus tasks 
are given equal priority because they cannot pre-empt each other, but the CPU 
bus task has lower priority because it can be pre-empted at any time. The prior- 
ity is indicated in Figure 8 by a number at the lower left corner of ISDN, LAN, 
and CPU tasks in the first hardware layer. 



Polling Overhead The ISDN chips poll the ISDN TX buffers every 125 /is. 
The overhead for polling is included in the chip operations, but for empty polls 
(when there is no data to be processed) the overhead is added separately by the 
ISDN_Poll task in Figure 8. Every 125 /iS it requests an average of MP polls, 
for M chips, using the probability P of an empty poll given by 

P=l-jVX 

f Maximum transmission speed of a B channel, being 8000 bytes/sec. 

V Average size of the frame to be transmitted. 

A Frames per second to be transmitted on a B channel. 

The Worker task also polls, and the overhead for its empty polls is represented 
by the Sleep task which is triggered once per poll cycle. 



Asynchronous Operations. DMA operations initiated by the worker task are 
carried out asynchronously and are triggered in the model by an asynchronous 
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signal indicated by an open arrowhead in the Figure. Requests to the operation 
tasks are also made asynchronous, since the source is not blocked while the 
router operates on a packet or frame. Since the model assumes that every packet 
is processed, the operation tasks are all given an infinite multiplicity. 

3.4 Step 4: Simplifying the Model 

A final optional step is to perform any simplifications on the model, which will 
not effect the results. A motivation for doing this is to reduce solution/simulation 
time of the model, by reducing the number of separate entities in the model. The 
final model is given in Figure 8. The Main Bus and Memory entities have been 
merged since the bus does not contend for memory. Similarly the LAN firmware 
does not contend for the LAN chip device and has been merged with the device 
entity. The ISDN chip logic has been kept separate because it has the polling 
logic. 




Fig. 8. Complete model with Asynchronous Calls, Software Detail and Hardware 
Simplification (Merged Layers) 



4 Model Parameters and Validation 

4.1 Parameter Gathering 

Execution times of code were measured with a profiling tool called CodeTEST 
|T], which has a pod that piggybacks the pins on the processor chip. The software 
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is instrumented by calls inserted in the code with probe IDs, which are also put 
into a database along with the location of the probe (that is, the source file and 
line number). 

In this way, software service demands for the Worker task entries HD_RX, 
HD.TX, TX^UF, RX^UF, LAN.TX, LANJIX, and MAINT were obtained 
for packet and frame sizes of 64, 128, 256, 500, 1000, 1500 bytes. Polling over- 
head (cost of an empty poll) for the entries above were also measured using the 
CodeTEST tool and put into the model. These measurements were done under 
a moderate load with 30 active channels. 

The parameters for the interface chip operations were obtained from data 
books. This gave the time for the ISDN and LAN chips to read and write data 
to main memory, and in the case of the ISDN chip, also provided the time for 
one poll to the ISDN transmit buffers. 



4.2 Model Validation Experiments 

Validation was done by comparing predicted results from the model with mea- 
sured delays across the router, for both the inbound (LAN to ISDN) or outbound 
(ISDN to LAN) scenarios. Packet sizes of 128, 500 and 1500 bytes were used, 
and the number of active channels was varied from 10 to 120 in steps of 10. Two 
routers were configured as a test environment for a third one, which served as 
the system under test, and was instrumented for total delay. The traffic direction 
(inbound, outbound) is described relative to this third router. 



4.3 Model Validation Results 

Figures 9 and 10 show the results of the end to end delay measurements. The 
delays are given for the outbound traffic, except for the two inbound traffic 
experiments in Figure 9. In all cases the number of active channels was increased 
up to the point where the router dropped 1% of the packets. This number of 
active channels was taken to be the router capacity for that traffic profile. The 
number of active channels can only go as high as 120 because this was a limitation 
of the LAN/ER. 

Table 1 gives a summary of the model’s accuracy for a wide variety of ex- 
periments. The runs are described by the traffic direction and packet length. In 
each experiment the number of active channels was varied, and the table reports 
the results for the largest number of active channels that gave acceptable model 
error (which was taken as 12%). In the experiment there was an upward trend in 
the error in all cases as the number of active channels increased. Thus the first 
row of Table 1, corresponding to the lowest line on the graph in Figure 9, shows 
that the accuracy of the delay predictions is within 10% over the entire range of 
active channels. The same is true for the last line in Table 1, and is true for all 
but the last point in the curve for the other lines except for rows 2, 3 and 4. The 
experiments of rows 2, 3 and 4 had high CPU utilization and system congestion 
at the top ends of their ranges. Row 3 shows saturation for a relatively small 
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Fig. 9. Uni-directional measurement results for X — Y delay, with every B 
channel 90% utilized 



number of channels because each channel carries a lot of short packets, and the 
CPU effort is dominated by a fixed amount per packet. 

The source of the errors appears to be over simplification in the modeling 
of the bus. When the CPU is waiting for the bus it is effectively busy and an 
error in modeling the bus, which inflates the time the CPU waits for the bus, 
will inflate the CPU utilization in the model. When the utilization is already 
high, this error has a magnified effect on the end to end delay. The bus modeling 
simplifications are: 

— the ISDN chip arbiter shown in Figure 2 was not modeled (probably not 
important) 

— the CPU locks memory in critical sections (not modeled) and this will pre- 
vent higher priority devices from taking the bus; this could influence the 
results at saturation; 

— the bus arbitration time used in the model was the maximum value, and is 
a significant fraction (approaching 10%) of the memory access time which is 
the service time of the bus/memory element at the bottom of Figure 8. If the 
actual arbitration time were less, it would increase the predicted saturation 
load. 

— the priority arbitration between the LAN chip and the ISDN chips was more 
complex than described in Section 18.81 which might have an effect. 
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Fig. 10. Bi-directional measurement results for {X — Y) delay, with every B 
channel either 45% or 90% utilized 



Table 1. Summary of End-to-End Delay Errors: Range of system operation for ade- 
quate prediction accuracy. Higher utilizations gave errors over 10%. 



Max Active Channels Max Delay Max CPU 
Experiment Run for Error <= 12% Error(%) Utilization 



Outboundl500 


120 


9.8 


0.512 


Outbound500 


80 


9.6 


0.64 


Outboundl28 


30 


8.3 


0.61 


Inboundl500 


80 


9.3 


0.411 


Inbound500 


60 


5.1 


0.643 


Outboundl500, Inbound500 


60 


9.7 


0.56 


Outbound500, Inbound500 


50 


9.0 


0.59 


Outbound500, Inboundl28 


40 


9.3 


0.588 


Outboundl28, Inboundl28 


40 


11.5 


0.80 



In all cases in Table 1, the bottleneck was the main bus. This result is not 
surprising, because every device in the system requires the use of the main 
bus/memory combination. Overall the inbound traffic capacity for the LAN/ER 
predicts roughly 45 to 60 channels less capacity than the outbound case given 
same load. This is attributed to the ISDN chip wasting bus time polling empty 
TX buffers, when all the traffic is going the other way. For the outbound case 
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the ISDN polling has no significant effect, because when the chips are busy 
transmitting frames they do not poll the buffers as often. 



5 Conclusions 

This paper has described a layered Resource-based Model Architecture (RMA) 
for hardware-software systems, and illustrated its creation and use on a small 
LAN extension router. The layered model shows the dependencies of resources 
on other resources. Through these dependencies the model also shows how con- 
tention effects spread upward from a congested resource, such as the router bus. 

The process of creating the model is systematic and straightforward. Struc- 
ture and traffic parameters were found by tracing scenarios, and demand pa- 
rameters were found by a form of profiling. Of these, the demand parameters 
required the greater effort. 

The measurement effort to find the demand parameters took 3.5 man-weeks 
starting from scratch with just a basic understanding of the system. The vali- 
dation measurements took about 7 man-weeks, but they could be less detailed 
and less time-consuming in a mature process. Sufficient validation data could be 
obtained from routine stress testing, to track the modeling accuracy. Therefore 
the RMA approach combined with an established process of gathering measure- 
ments is a practical way of tracking the performance of the system at any point 
of the development cycle. 

Layered modeling has been used before this for distributed software systems, 
but this is its first application to a system with layered hardware resources. A 
router is typical of systems with significant functionality in hardware, which 
require modeling of the interactions among the hardware components. Layered 
modeling provides a strategic level of detail. It captures the dependencies which 
affect performance, without the high cost of greater detail. 

The model was also used to suggest a new system configuration to relieve 
the bus congestion, and to evaluate the degree of improvement. The model was 
able to identify new resource interactions which made the new capacity less than 
expected. 
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Abstract. The architects of today’s distributed applications have a wide 
range of Internet technologies, platforms, and design patterns to choose 
from. In addition to the usual selection criteria of security, portability, 
maintainability, and cost, performance often determines the selection of 
one system architecture over another. This paper presents a quantitative 
technique that can help an architect understand the expected behaviour 
of an application deployed within a target environment. The technique 
automatically finds an object allocation that optimizes a performance 
metric specified by the architect. The technique supports multiple 
classes of requests and mean response time requirements for multiple 
workload conditions. Capacity constraints are also considered. These 
include device utilization limits and the maximum number of 
customers. A deployed application is described using a Layered 
Queuing Model (LQM). Non-linear and linear programming techniques 
are combined with predictive analytic modeling techniques to 
efficiently compare application configuration alternatives. Both non- 
asymptotic (no saturated resources) and asymptotic workload conditions 
are considered. 



1 Introduction 

The architects of today’s distributed applications have many technologies and 
architectural patterns to choose from. Technologies include XML [4] and HTML 
[17], platforms such as Sun’s Java 2 Enterprise Edition (J2EE) [10], Microsoft’s 
Distributed InterNet Architecture (DNA) [7], and the Object Management Group’s 
Common Object Request Broker Architecture (CORBA) [20]. Typically, 
architectures have several logical tiers that separate user interface (browsers) from 
presentation (web servers), application logic (transaction/integration servers), and 
data services (databases). There are network centric architectural patterns with thin 
clients, and others with fat clients. Similarly data access may be abstracted from 
application logic, for example using Enterprise Java Beans [16]. 

There are several degrees of freedom for controlling the performance behaviour of 
such systems. These include: choosing a hardware topology composed of appropriate 



B.R. Haverkort et al. (Eds.): TOOLS 2000, LNCS 1786, pp. 25-39, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 




26 



M. Litoiu and J. Rolia 



processing nodes and network infrastructure, grouping objects into processes, 
allocating the processes to nodes, and choosing the multiplicity of objects and/or 
processes (threading or replica levels) and their activation policies [14]. 

This paper focuses on object and process allocation issues. The techniques are 
applicable to transaction-oriented systems, such as electronic commerce systems, 
where mean performance metrics are of interest. 

Allocation is a well-studied topic in high performance computing. However, the 
focus is primarily on scientific applications instead of transactional systems. Most of 
the techniques assume homogeneous tightly coupled systems and in general do not 
take into account software and hardware queuing delays or variable workload 
conditions. There are two main allocation strategies: 

• dynamic allocation (or load balancing) - application processes (or objects) migrate 
at runtime to the least loaded nodes (or servers). Dispatching decisions can be 
based on metrics such as queue lengths [6], [21] or other system states [12]. 

• static allocation - allocation is done once during the system implementation phase 
and is based on static assumptions about workloads, system parameter 
distributions, and object interactions. Exact and heuristic methods such as 
simulated annealing, tabular search or genetic algorithms are used to minimize the 
mean response time of the application [8]. 

Implementations of platforms such as J2EE and DNA support object and process 
migration behaviour via request redirection mechanisms. Eor example, a user request 
can be routed to a specific Java servlet server within a pool of servers as part of one 
system configuration or to another server for another configuration. The mechanisms 
for supporting this are offered by HTTP protocol itself, by cookies and high synergy 
servers that share the state of the user requests. Management systems for these 
platforms can manipulate allocation via the control of (object name, object location) 
pairs within name servers [9]. 

The object allocation approach presented in this paper combines static and dynamic 
allocation strategies. Erom dynamic allocation we borrow the idea that allocation 
depends on the workload, so many allocations are needed. However, the general goal 
of this work is to find a small set of allocations that cover many key workloads. 
Management systems can treat these allocations as system configurations and shift 
between them as workloads change. Managing a small number of configurations is 
likely to be easier than managing a full dynamic allocation strategy. Appropriate 
allocations/configurations are chosen using static allocation techniques. 

Section 2 considers performance requirements and capacity constraints. It 
introduces a metric called Satisfaction, which is used by an architect to express user 
expectations. Satisfaction aggregates per class mean response times over performance 
requirements expressed as intervals. We optimize this function over all population 
mixes by finding the worst per class response times. At a specific population, there is 
a factorial number of population mixes and evaluating them all is intractable. 
However, by efficiently finding the mix that causes the worst values for metrics 
[1][2][13], we ensure that all other mixes will give better values for the specific 
metric. Section 3 describes algorithms for finding worst case values for Satisfaction 
metric under capacity constraints. Algorithms for finding configurations that 
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maximize Satisfaction are presented in Section 4. Section 5 presents a case study. The 
paper’s conclusions are offered in Section 6. 



2 Performance Requirements and Capacity Constraints 

To begin, a distributed application system under consideration is described as a 
Layered Queuing Model (LQM) [18]. LQMs are extended Queuing Network Models 
(QNM)[11] that include software processes and devices, such as processors and 
network elements, as queues. We consider C customer classes and describe workloads 
using two dimensions: the population level {N) of customers and the population 
mix (P). N denotes the total number of customers in the system, the population mix 
gives the fraction j5c of the total population in each class c, for c = 1 to C. The number 
of customers in class c Nc = N Pc- N = N p is a vector that denotes per class 
populations. 

Distributed applications are subject to many possible workloads, each of which 
may cause different system bottlenecks and therefore require different allocations. In 
this section, we develop expressions for capacity constraints, performance 
requirements, and a satisfaction measure for comparing the behaviour of the 
alternative allocations. 



2.1 Capacity Constraints, Workloads, and Performance Requirements 

The proposed technique distinguishes between population levels N that do or do not 
saturate one or more resources. Resources that can saturate are called bottlenecks. A 
saturated resource has a utilization of 1. If a resource can saturate, we are under 
asymptotic workload conditions. If a population level cannot saturate any resource, we 
are under non-asymptotic conditions. 

For non-asymptotic conditions, an architect specifies utilization limits for 
resources. These are expressed as li^ < 1 for k=l, K\ where K is the number of 
resources. The technique then computes the maximum population level N such that 
these limits are never exceeded. For asymptotic conditions, resources are permitted to 
saturate (i.e., 4 =7 for some resource k). To constrain the study of workloads and 
bottlenecks, an architect specifies a total population level limit Nmax- Utilization and 
population limits are capacity constraints. We refer to an aggregate capacity 
constraint I as the set of utilization limits for a non-asymptotic case and as the total 
population level limit for an asymptotic case. 

When defining capacity constraints, it is not necessary to specify all possible 
population mixes, i.e. many p. They are too numerous (factorial). The proposed 
analysis technique of Section 3 automatically considers the workload mixes that cause 
worst case behaviour for the given constraints. 

The presented technique also takes into account limits on software utilization in the 
LQM. This is important because not all software components can be replicated; a 
critical section for example cannot be replicated and has a utilization limit of one, 
therefore software queuing delays are inevitable. There are also software components 
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whose replication level is restricted by design or implementation constraints. Both 
licensing (financial) and implementation constraints can affect replication policies for 
processes. For example, a target system may only have the license for some number, 
an integer greater than zero, of instances of a database server process subsystem. 
Alternatively, it may not be possible to introduce more than one instance of a database 
into a system without significant changes to application software. In this case, a 
replication limit in the LQM would be 1. In this paper, to keep the presentation 
simple, we do not consider software utilization limits explicitly, but we reflect them in 
the underlying LQM and respect them when choosing alternative allocations. 

We distinguish between the following workload conditions: 

• Light workload: all device utilizations are low, and the queuing delays are very 
small, so the response time of a class is close to the sum of its resource demands; 

• Medium workload: the device utilization are medium, the queuing delay at devices 
or software components contribute significantly to response times; as a result the 
response times are sensitive to the total population N and the workload mix 

• High workload: the customer population is such that there is at least one device or 
software entity that is saturated. In this case the response times are quasi-linear 
with respect to the customer population and non-linear with the workload mix p 
[ 2 ]; 

• Very high workload: the customer population N approaches infinity. 

Capacity constraints can be specified for any of the workload conditions defined 
above. The light condition has been well studied in the high performance computing 
literature and very high workload conditions are to be avoided by design. In this paper 
we focus on workload conditions for medium and high workloads. A study of the very 
high load conditions can be found in [13]. 

For each aggregate capacity constraint I, a mean response time requirement is given 
for each class c. It is defined as an interval where These 

values may come directly from requirement documents and/or expected Service Level 
Agreements (SLA), or they may be decided iteratively during the system’s design 
exercise. The lower bound should be seen as a preferred maximum mean 

response time', the upper bound R'^’pp as a maximum acceptable mean response time. 



2.2 Satisfaction Function 

Response time requirements are specified for many classes of requests. However, to 
compare alternative allocations, a measure of satisfaction of requirements is needed. 
For aggregate capacity constraint I, we characterize an allocation A with regard to a 
class c by a function that we call SatisfactionJ (A ), which is defined on the set of all 
allocations and takes values in the range [0, 1] such that: 

• It is 1 when the preferred maximum values for response times are fulfilled for all 
population mixes; 

• It is 0 if the acceptable maximum mean response times are not met for some mixes; 
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• It is a linear function that materializes the following rule: the closer the maximum 
per class response time is to the acceptable value, the closer the value is to 1. 



Satisfaction c (A) = 



R Up-R worst 



j-y C ,l j-yC,l 

R Up- R Lo 



if R worst ^ R Lo 

• ^ jy C jy C ^ jy C 

It K Lo"^ ti worst Si K Up 



( 1 ) 



otherwise 



We define R‘^’‘worst as the maximum mean response time that is encountered by a 
class c for I over all population mixes. 

Satisfaction (A) measures the overall Satisfaction per aggregate capacity constraint 
and is defined with respect to the Satisfaction of each class: 

Satisfaction'(A) = mine (Satisfactione’(A)) 



The allocation process allocates application objects across the hosts with the goal of 
maximizing Satisfaction (A). Other metrics could be used by the proposed technique 
as well. 

We call the result of an allocation a configuration; resulting configurations depend, 
among other factors, on the aggregate capacity constraints Z; in the extreme case, each 
aggregate capacity constraint I may generate several distinct configurations for 
meaningful workload mixes. As stated earlier, our general goal is to identify a small 
set of configurations that adequately support a system’s diverse workloads. 



3 Finding the Maximum Mean Response Times 
for a Given Configuration 

This section describes methods for finding per-class maximum mean response times 
over all population mixes. The techniques for medium and high workload conditions 
are slightly different; the differences are explained. Since the approach is the same 
for any aggregate capacity constraint I, I is excluded from the notation. 



3.1 Differences Between Medium and High Workload Conditions 

Medium workload conditions are non-asymptotic conditions such that no resource is 
saturated. For these conditions the behaviour of the system is found by solving its 
corresponding LQM using the Method of Layers [18]. Queuing at both software and 
hardware resources is taken into account. To find the maximum mean response time 
of a class c, a mathematical programming model is needed. Mathematical 
programming offers optimization methods that deal with the problem of finding the 
minimum or maximum of an objective function in the presence of equality and 
inequality constraints. If the objective function and constraints are linear, we have a 
linear program-, otherwise, we have a non-linear program. Medium workload 




30 



M. Litoiu and J. Rolia 



conditions are related to non-linear programs. For high workload conditions, response 
times are quasi-linear with respect to the total customer population. For these systems, 
linear programming methods are used. 

Linear and non-linear programming methods have been studied rigorously for over 
50 years. There are mature algorithms that handle many thousands of variables, which 
are adequate for even large distributed system models. 



3.2 The Programming Model 

The average response time for a request of class c for any capacity constraint, R'^, is 
given in equation (2). 

K I \ (2) 

R^(N) = l,(D. +W-(m+ L K(N)j 

1=1 peO 



where: 

• K is the number of devices; 

• D'^i is the mean demand at device i by a class c request; 

• W^i is the mean waiting time at device i for a class c request; 

• O is the set of objects visited synchronously by objects within the class request; 
and 

• R“^p is the mean waiting time at object p including service time for class c requests. 



The parameters of equation (2) are the parameters for a system’s corresponding LQM 
and the results of performance evaluation. 

• Ko, D"i, O are the input parameters of LQM. Additionally, the system’s LQM 
requires: application configuration parameters — the number of replicates and 
threading level of each process; workload conditions — per-class think times and 
population vector N=(Nj, N 2 ■■■ NJ, where Nc is class c customer population; and 
execution environment policies — device scheduling disciplines. 

• W^i, R'^, R^p are outputs of a performance evaluation of the LQM. Additional output 
values include per-class process, object, and device utilizations and mean queue 
lengths. 



The problem of finding maximum R\ in the presence of utilization constraints 4 can 
be expressed as a mathematical programming problem as follows; 



Max R'^ 



Subject to Uk = Ukc < Ik 

Uke > 0, V k G K, V c e C 



( 3 ) 



where C and K are the sets of classes and devices (with cardinalities K and C, 
respectively), L4 is the total utilization of device k, and U^c is the utilization of device 
k by requests of class c, and 4 is the maximum permitted utilization of device k. 
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By using device r as reference[13] , we can rewrite equation (3) as: 
Max R‘^ 

Subject to \Jk = Drc^Urc ^ Ik 

Uh,U,2,....U,c >0 



The constraints of equation (4) generate a feasible space [3]. The points in the 
feasible space where two or more constraints reach their limits (4) are called extreme 
points. At extreme points, one or more devices reach their utilization limits (in the 
asymptotic case, they saturate). These points give workload mixes that must be 
studied to ensure that worst-case response times are considered. However, in general, 
there are a combinatorial number of such points; mathematical programming methods 
are used to limit the number of points considered. 

To ensure a unique solution for equation (4), mathematical programming methods 
require that be continuous and convex with respect to decision variables U, =(Uri, 
Ur 2 , ■■■yUrc). Also, the decision variables should be continuous. We ensure the 
continuity of the decision variables and therefore of R'^ by considering the class 
populations as real numbers. The convexity of the response time is ensured by the 
LQM and associated mean value analysis solution methods. Note that if R'^ is non- 
linear, then the solution of equation (4) is on the boundary of the feasible space given 
by the configuration constraints. If R‘^ is linear then the maximum is at one of the 
extreme points. A good approximation of the maximum may be given by considering 
R‘^ linear and searching the maximum in the extreme points of the feasible space. 

The next section illustrates the use of the linear convex simplex method [3] to 
navigate through feasible space and the MOL [18] to solve the LQM to get estimates 
for the R\ 

3.2.1 Implementing the Programming Model 

When solving equation (4), we assume that the think times associated with each 
request class are fixed. The following algorithm (Fig. 1) is used to determine 
maximum values for the R'^ and the population vector N=(N], A4 ... NJ, under which 
the maximum occurs. 

For Step 1, we choose to set the per-class device utilizations to zero. 

Step 2 is an application of the linear simplex algorithm, which means that we 
approximate the solution of equation (4). It finds the search direction in utilization 
space by establishing a new goal for U,, namely U r. The search path is based on 
pivoting using natural and slack utilization variables by computing numeric estimates 
for the partial derivatives from the results of the MOL [14]. Once a search direction is 
found, the new value for Ur is computed by finding the extreme point in that direction. 

Step 3 depends on whether the aggregated capacity constraints are utilization 
(medium workload) or population level (high workload) constraints. 
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R' Algorithm 

Input: LQM, configuration and capacity constraints 
BEGIN 

FOR c = 1 TO C 
BEGIN 

1. Choose an initial point in the feasible utilization space that gives an 
initial Ur 

2. Using the simplex algorithm, choose a new value U*r to increase R'^ 

3. Repeatedly solve the system’s LQM using the MOL to get an N that 
causes U r and provides a new estimate for 

4. Repeat steps 2 to 3 until Ur_cannot change without decreasing 
END 

RETURN {R" c=l,C; Ur*, N} 

END 



Fig. 1. Algorithm for finding maximum per class response time for a given configuration. 



Step 3 for Medium Workloads. For utilization constraints, step 3 iteratively applies 
the MOL using a hill climbing algorithm; it searches for customer population vector N 
that achieves U*,. From an initial population at the 0-th iteration N** = (l/'i rP 2 
for example the zero vector, we search toward the target point U*r with the 

throughput vector (X c) = (U r/Dri, ,U rc/Drc) by varying class population 

and solving the LQM. With each solution of the LQM, we have new estimates for N 
and corresponding R^ 

Step 3 Under High Workloads. Under asymptotic conditions, a point in device 
utilization space may be obtained from many mixes of class populations and for many 
total populations. The counterparts of extreme points in utilization space are 
saturation sectors in the population space. The saturation sectors are delineated by 
crossover points, as considered in the Asymptotic Bound Analysis (ABA)[1][2]. A 
saturation sector is the set of the workload mixes that yield an extreme point. By 
definition, within a saturation sector, per class utilizations of the corresponding 
devices are constant. Additionally, at crossover points the population mix equals the 
corresponding device utilization. The number of crossover points of a saturation 
sector equals the number of devices saturated in the corresponding extreme point. 
These can be computed as follows: if v denotes a crossover point, then the ratio of the 
class c population over the total population N is P' c=Nc/N=lf ic, where i is the device 
saturated in that crossover point [2]. 

Let be the population level capacity constraint, i.e. the maximum number of 
customers allowed in the system. Then, for each crossover point v, the properties of 
ABA, for product form queueing networks, prove that [2].- 



c=l...C 



( 5 ) 
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We note that LQMs are not product form queueing networks; however, the 
property provides a useful approximation. 

Step 3 finds the search direction by computing the derivatives of the objective 
function with respect to population mixes at the current crossover point. Once a 
search direction is found, the next point in the utilization space is found by directly 
applying the simplex methods. 



4 Finding Configurations that Maximize Satisfaction 

Fig. 2 gives a high level description of the allocation algorithm that finds a 
configuration to maximize satisfaction. The procedure is common for both medium 
and heavy population levels and uses the Algorithm of Fig. 1 for finding the worst 
per class response time for a given configuration. The algorithm is illustrated further 
by example in the case study of Section 5. 

Step 1 gets an initial configuration. Step 2 repeatedly changes the configuration to 
increase the Satisfaction of requirements. Step 2.1 computes the worst response time 
over all population mixes as shown in the previous section. Note that If Algorithm 
returns both R,. and Uf. Step 2.2 determines a new configuration by re-allocating 
moveable objects based on the data collected at Step 2.1. We note that in general 
allocation problems are NP complete so heuristics are needed to decide new 
configurations. In the next section, we present the result of a case study using a simple 
greedy allocation heuristic. 



Allocation Algorithm 

Input: LQM, performance requirements, hardware topology, current configuration 
design constraints 

BEGIN 

1 . Get an initial configuration 

2. REPEAT 

2.1 Eor each c, find worst by the Rc Algorithm and compute the 
Satisfaction 

2.2 Generate a new configuration, based on the data collected at 2. 1 
UNTIL Satisfaction is 1 or it cannot be improved 

RETURN { Configuration, Satisfaction‘s} 

END 



Fig. 2. Algorithm for finding the configuration that maximizes Satisfaction. 
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5 Case Study - A Multi Tier Distributed Application 

To illustrate the above algorithms, we consider a multi tier application. It has 5 classes 
of request, 5 nodes (excluding the client nodes), 10 server processes and 24 objects. 
To keep the presentation simple, each server process is a container for objects and 
there is no distinction between the web, application or integration servers. A hardware 
layout and an initial configuration are shown in Fig. 3. CPU and Disk service rates for 
the nodes are presented in Table 1 and object interactions (remote method 
invocations) are shown in Fig. 4. Though interactions are the same for each class, the 
resource demands differ for each class. 

Except for the objects Ou to O 24 , which can migrate between nodes, all other 
objects are pre-assigned, as shown in Fig. 3. This figure shows the initial allocation of 
the objects, the start configuration for our search. This configuration obeys the 
“locality principle” [19] [15] and places most of the objects that interact with each 
other in the same container. Server 1. To keep the presentation simple, we assume that 
we can assign any object that can migrate to any container. The think time for all 
classes of request is specified as 10 s. 

The application should satisfy the following performance requirements at two 
population levels and A^'^*). Both are considered high workload conditions 

because devices are permitted to saturate; 

(a) [R°lo R°up]=[0 30000]; [R‘u, R‘up]=[0 10000]; 

]R^u,R\p]=]0 40000]; [R^^, R'un]= ]R\o R%p]=]0 50000] ms. 

(b) N^'^’'=1000, ]R°p„R°up]= ]R\o Rup] = ■■■= ]R\.R\p]=]0 50000] ms. 




Fig. 3. The initial object allocation. 
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Table 1. CPU and DISK service rates. 





Client Node 


Node 1 


Node 2 


Node 3 


Node 4 


Node 5 


CPU 


1 


0.1 


0.1 


0.1 


0.1 


0.1 


DISK 


1 


0.1 


0.1 


0.1 


0.1 


0.1 



The Allocation Algorithm computes the worst case response time for each class 
using the Algorithm. Fig. 5 shows successive allocations/configurations for 
initial configuration has index 1. Computed worst case response times 
(over all workload mixes) are between 3000 and 5000 ms. For example, the worst 
case response time of Class 1 is 4200ms. The Satisfactions associated with the 
response times and allocations are shown in Fig. 7. Although the response time of 
Class 1 is not the highest of all per-class mean response times, the Satisfaction of 
Class 1 is the minimum (0.59) due to its strict performance requirements. As result, 
the Allocation Algorithm tries to improve the Satisfaction of Class 1. 




o„ 



Per-class average object CPU demands are uniformly distributed between [10, 200] ms 
Per-class average object DISK number of visits are uniformly distributed between [1, 8] visits 



Fig. 4. Object interaction. 



To improve the satisfaction of Class 1 our greedy heuristic finds the host that 
affects Class 1 response time most (Node 1 in this case), finds the object with the 
maximum demand by Class 1 (O 20 ) on the host, and then moves it to the least utilized 
host (Node 5). This gives configuration 2. Next, the Allocation Algorithm starts a 
new iteration. 

For each of the performance requirements, and the Allocation 

Algorithm generates 10 configurations before satisfaction cannot be improved. Fig. 5 
and Fig. 6 show the evolution of per-class response times for and A^'**, 

respectively. The configuration that maximizes the satisfactions (in this particular 
case, the same for both population levels) is presented in Table 2. Changes to 
Satisfaction for case are presented in Fig. 7. 
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Table 2. The configuration that maximizes the satisfaction. 



Ol3 


Ol4 


Ol5 


Ol6 


Ol7 


Ol8 


Oi9 


O20 


O21 


O22 


O23 


O24 


Sv. 


Sv. 


Sv. 


Sv. 


Sv. 


Sv. 


Sv. 


Sv. 


Sv. 


Sv. 


Sv. 


Sv. 


4 


9 


10 


1 


1 


10 


1 


10 


8 


9 


1 


8 




1 23456789 10 

Configurations 



-♦ — Ciass 0 
.■ — Ciass 1 
Ciass 2 
^ — Ciass 3 
— Ciass 4 



Fig. 5. Maximum mean response time versus configurations, N=500. 






- Class 0 


■ 


- Class 1 




Class 2 




- Class 3 




- Class 4 



Fig. 6. Maximum asymptotic mean response time versus configurations, N=1000. 



We draw several conclusions from this case study: 

• The response times of each class of request evolve similarly as objects move from 
one server to another; this is because the structure of their requests is similar and 
object demands are uniformly distributed in a narrow interval. In general classes of 
requests will use different objects and be affected differently by configuration. 
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Qualitative heuristic strategies for object allocation can be misleading; clustering 
objects by the “locality principle” alone, as used for our initial configuration, does 
not yield good performance. 

A good allocation can substantially reduce the mean response time of a distributed 
system as a whole and of each class of requests. For example, for l;jje 

response times of each class were reduced approximately 80% with respect to the 
initial configuration. 

Response times are affected a great deal by population mix. Fig. 8 illustrates per- 
class changes in response time for several extreme population mixes for the final 
configuration with = 50 O. Depending on the mix, response times vary by 

almost 50%. This illustrates why workoad mix must be taken into account when 
comparing allocation alternatives. 




Fig. 7. Per class Satisfaction, N=500. 




□ N=(496 1 1 1 1) 
n N=(1 496 1 1 1) 

□ N=( 1 1 496 1 1) 

□ N=(1 1 1 4961) 
■ N=(1 1 1 1 496) 



1 2 3 4 5 



Class 



Fig. 8. The change in response time with population mixes 
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6 Summary and Conclusions 

The paper presents an object allocation technique for multi-class distributed systems. 
The technique uses performance requirements defined for each class of request and at 
different population levels. An aggregation function is defined on the performance 
requirements to give a measure of how well the system’s requirements are satisfied as 
a whole. An allocation algorithm is given that finds the maximum per-class response 
time for each class for a configuration over all workload mixes and then changes the 
configuration to improve mean response times with respect to the requirements. 

The algorithm for finding maximum per class response time relies on non-linear 
and linear-programming models. Consequently, it can be generalized for any metric 
that satisfies the properties of convexity. Similar techniques can be developed for 
throughputs, object utilization, or any convex combination of these metrics. 

The proposed techniques can be used to help decide object allocation strategies in 
response to changing workload conditions. The complexity of the algorithms is 
appropriate for the scope of their application, off-line support of system design. For 
example, the linear simplex method is practically polynomial, proportional to the 
number of device constraints [5]. For the case study, solution times were in the order 
of 5 minutes for the two cases together. 
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Abstract. Traffic characterization and modeling has been an extensive 
area of research in the last few years. Many of these studies aim at con- 
structing accurate models to predict the network performance. Perfor- 
mance studies includes: analysis of admission control algorithms, buffer 
dimensioning, and many others. Several steps are needed to conduct a 
performance study. First, it is necessary to characterize the traffic gener- 
ated by the applications, second it is important to choose an appropriate 
model to represent this traffic. The analysis of the accuracy of a traffic 
model is, in general, based on the match of some descriptors and on how 
well it predicts the performance measures. Finally, the user would like 
to construct and solve a network performance model. A large number 
of models have been proposed in the literature to describe a variety of 
traffic generated by data, audio and video sources. The model which is 
the more accurate for each type of traffic is still an open issue in the 
literature, and thus it is important to provide an environment to aid 
the user in the development and analysis of traffic models. The focus of 
this study is two-fold: to obtain analytical expressions for some impor- 
tant traffic descriptors calculated from general Markovian models and to 
present a set of modules we have implemented to provide an environment 
useful for traffic modeling, analysis and experimentation. These modules 
are currently being integrated in the TANGRAM-II modeling tool. 



1 Introduction 

The modeling and analysis of computer network traffic has been an area of 
extensive research over the last ten years, as new multimedia applications over 
the Internet become a common place mm- One of the main goals of teletraffic 
engineering is to be able to develop accurate models in order to predict, with 
sufficient accuracy, the impact of the traffic generated by applications over the 
network resources, in order to provide the necessary quality of service (QoS) to 
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the end users. Performance studies include determining buffer behavior, study 
congestion control policies, admission control algorithms, and many others. 

In order to conduct a performance study, several steps are needed. First one 
must understand the characteristics of the traffic competing for the resources 
under investigation. 

One of the challenges of traffic characterization is to obtain a concise descrip- 
tion of complex traffic flows. Simple descriptors such as the mean rate and peak- 
to-mean ratio have been widely used, but may not be sufficient to accurately 
represent the burstiness characteristics of the traffic. Several other descriptors 
have been proposed such as the peakdeness coefficient, the autocovariance, the 
index of dispersion over an observation time, and the Hurst parameter. They all 
try to capture correlations in the traffic flows. The issue here is to be able to 
determine the main parameters and correlation structures that have the biggest 
impact on the performance measures of interest. 

Another important step in a performance study is to choose a proper model 
to represent the traffic under consideration. Traffic models are analyzed based 
on: (a) how close the traffic descriptors values, obtained from the model, match 
against those obtained from real traffic traces; (b) how well a resource model, 
fed by a chosen traffic model, predicts the performance measures under study. 

A large number of models have been proposed in the literature. They in- 
clude Markovian models, such as simple on-off models, Markov modulated Pois- 
son processes, Markov modulated fluid models m, regression models such as 
Transform-Expanded Sample (TES) models [6l7j and Autoregressive Moving 
Average (ARMA) models [T]. These models have the property that the auto- 
correlation function decays exponentially. However, it has been noticed in a 
number of studies that, for many different types of traffic, the autocorrelation 
decays at a slower rate than exponential mm- These traffics are said to have 
long-range dependence. The fractional autoregressive integrated moving aver- 
ages (FARIMA) process, and the Fractional Brownian Motion (FBM) process 
are examples of long-range dependence processes. 

Although not possessing the long-range dependence property, Markov mod- 
els are still attractive due to several reasons. First they are mathematically 
tractable. Second, it has been shown that long-range correlations can be ap- 
proximately obtained from certain kinds of Markovian models (e.g. Eli)- These 
are basically obtained by superposing on-off sources where the on and off periods 
have high variability [12]. Third, works such as [13] indicate that long-range de- 
pendence is not a crucial property for some performance measures and Markov 
models can be used to accurately predict performance metrics. 

The paragraphs above outline the importance of providing the modeler with 
a set of tools to perform a range of experimentations including the ability to mea- 
sure traffic, obtain descriptors, and experiment with different models. One would 
like the modeler to be able to: (a) collect statistics from real traces; (b) choose 
from different traffic models (which includes Markovian, FBM and FARIMA 
models); (c) calculate descriptors from the models to be able to match param- 
eters and/or verify statistical differences from the model to the measured data; 
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(d) create a “complete” performance model which includes the traffic model and 
the resources under study; (e) solve it via simulation or analysis and; (f) pos- 
sible conduct experiments with traffic generators over a controlled laboratory 
environment. 

Since Markovian models still play an important role for traffic engineering, 
one of the objectives of this paper is to obtain analytical expressions for some 
important traffic descriptors calculated from general Markovian models. It is 
not always an easy task to obtain analytical expressions specially for descriptors 
based on second-order statistics of general Markovian models. In the literature 
expressions for descriptors, such as the index of dispersion and the autocovari- 
ance, are only found for simple Markovian models, such as MMPP models with 
2 states. For instance, in m an expression is obtained for the autocovariance 
function of a MMPP. The autocovariance is shown to be a function of the expo- 
nential of the rate matrix that modulates the arrival process. However, analytical 
expressions that are simple to compute for generic Markovian models are not 
found, though they are important to facilitate the modeling task. 

The other goal of this paper is to present a set of modules we have im- 
plemented to provide an environment useful for traffic modeling, analysis and 
experimentation. One tool that has similar goals as ours is SMAQ (TB]. SMAQ 
works with one type of Markovian model called circulant modulated Poisson 
process (CMPP). Once statistics are collected from a trace, a CMPP model is 
automatically constructed by matching first and second order statistics from the 
traces m- The resulting traffic model can either be used to obtain a trace or in 
an overall queueing model that is solved analytically. 

The tool we developed is not tailored to a specific model. Our concern is to 
provide the user with a range of modules to be able to experiment with different 
models and try different solution approaches. The modules developed include the 
following: (a) modules to obtain descriptors from traces; (b) modules to calculate 
analytically descriptors from a Markovian model; (c) traffic generators; (d) tool 
to combine a traffic model with resource models, and feed the overall model to 
a simulator or solve it analytically. The set of modules are being incorporated 
in the TANGRAM-II modeling environment m- 

In section [2l we introduce some background material on traffic descriptors, 
useful in traffic engineering. Section [3] shows how to obtain analytical expressions 
for a set of the descriptors mentioned in section[2l for general Markovian models. 
We describe the modules we have developed for traffic characterization and ex- 
perimentation in section |4l In section |5] a few examples illustrate the usefulness 
of our environment and section summarizes our results. 



2 Traffic Characterization 



As mentioned in the previous section, traffic descriptors try to capture the main 
statistical characteristics of the traffic been transported by the network. They 
are useful to access the demands for network resources. 
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Let A{t) = {A(s, s + r),s > 0} be the random process where A(s, s + 
t) is equal to the amount of traffic generated in the interval (s, s + r). One 
way to describe a traffic stream is through the knowledge of the distribution of 
„4(s, s + r) for several time scales r. Clearly this detailed information is of little 
practical use for implementing, for instance, control policies and the issue is to 
determine a concise set of descriptors that are easy to obtain and yet provide an 
accurate estimate of the performance metrics of interest, such as buffer overflow 
probabilities. 

The objective of this section is to briefly survey a few descriptors that have 
been proposed in the literature. We are not interested in discussing the usefulness 
of each descriptors. This is still subject of debate in the literature. Our goal is 
only to provide some background material for the remainder of this paper. 

The simplest traffic descriptors are: average traffic rate (limT_>oo A(0, r)/r), 
peak rate (maxg{ A(s, s+r) /t}) for a given time scale r (for instance the duration 
of a frame) and the burstiness (peak to average rate). Although they are useful 
traffic descriptors, they do not provide sufficient information about complex 
traffic rate correlations in different time scales, and more complex descriptors 
have been proposed in the literature 

One descriptor that captures traffic correlations is the autocovariance. Con- 
sider a stationary stochastic process X = {A(t) : t > 0} with mean fi and finite 
variance cr^. The autocovariance function of X is defined as: 



Cov(r) = E[X{t),X{t + r)] = E[X{0),X{t)] - (1) 



for lag T > 0. Process X{t) could be, for instance, the traffic rate at time t (and 
the time scale could be the duration of a frame) . 

Equivalently, the autocorrelation function is defined by: 



p(t) 



Cov(r) 



( 2 ) 



It is interesting to note that, for models that exhibit short-range dependence, 
the autocorrelation function has the form ce~^'^ (where c is a constant and 
/3 > 0), that is, it decays exponentially fast for large lags r. On the other hand, 
models that exhibit long-range dependence have autocorrelation function with 
the form cr“^ = or for some 1/2 < E[ < 1, where H is called 

the Hurst parameter [ 3 . 

Let Af = {N{t),t > 0} be a stationary process that counts the amount of 
traffic (packets, bits, etc.) that is transmitted in an interval of length t. The 
index of dispersion for counts captures the variability of the amount of traffic 
transmitted along a time interval. It is defined as |18I21| : 



IDC(t) 



Var{A^(t)} 



( 3 ) 



It can be shown that the IDC, for processes that exhibit long-range depen- 
dence, is monotonically increasing with t. That is, if we plot IDC(f) against logt 
we obtain an asymptotic straight line with slope 2H — 1. 
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The descriptors above are commonly used for traffic engineering. Other de- 
scriptors also exist such as the index of dispersion for intervals (e.g. [18]) and 
the peakedness [22] . 

Let us consider a traffic “burst period” defined when the traffic intensity 
is above a given level, for instance when a source is pumping traffic above the 
channel capacity. Intuitively, the duration of a burst can affect the utilization 
of network resources. For a given amount of traffic transmitted in an interval it 
may be of interest to study the effect of “long” versus “small” bursts. From usual 
availability measures we can easily obtain measures related to the duration of a 
burst. Let BD(t) be the random variable equal to the duration of a burst during 
an interval of length t. The expected value of BD(t) or even its distribution can 
be easily obtained from Markovian models. Another measure we can compute is 
the fraction of time a source is in “burst periods” for a given time interval t. 



3 Descriptors Computation for Markov Chains 

In this section we show how the descriptors defined in the previous section can 
be obtained from a (general) Markov chain model. The calculation of some de- 
scriptors require the use of transient analysis over a given time interval and we 
employ the Uniformization technique for that 

Consider a homogeneous continuous time Markov chain A4 = {M{t) : t = 
0, . . . oo} with infinitesimal generator Q and discrete state space S. Let P be 
the discrete time transition probability matrix obtained after uniformizing M. 
Let us assume that each state is associated with a reward, Xi been the reward 
rate of state i, and let A = (Ai, A 2 , . . . , Xn) be the reward vector, where N is the 
cardinality of S. For the traffic models we consider, the reward Xi represents the 
constant rate at which units of data (bits, cells, packets, etc.) are transmitted 
while in state i. 

It is clearly trivial to calculate the average traffic rate, burstiness, etc, from 
the steady state solution of our Markov reward model. As mentioned in the 
previous section, we can also easily calculate measures related to the duration 
of a burst, and so we omit these calculations. In this section we are interested 
in the autocovariance and the IDC(t). 



Autocovariance. The autocovariance is obtained from equation © where X{t) 
is the traffic source rate at time t. Uniformizing the Markov chain that models 
a traffic source we have that E[X{t)X{t + r)] is given by: 



E[X{t)X{t + T)] = 

00 00 

E m(m)X{n)] 

m=0 n>m 





(ylr)("-™) 
(n — m)l 



( 4 ) 



N N 

E[X{m)X{n)] = EE X,XjP[X{m) = i,X{n)=j] 

i=i j=i 



where 
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N N 

= '^\iTr^'^XjPij{n- m) (5) 

i=i j=i 

and Pij{n) is the n-step transition probability from i to j. The last equality is 
true since the Markov chain is assumed to be homogeneous. 

We define 

7 (n) = A(P(n))^, ( 6 ) 

rp 

where is the transpose of matrix A and P(n) is the matrix where the j-th 
element is equal to pij (n) . Using this definition in ([21) and the resulting equation 
in and considering that the chain is homogeneous and stationary we obtain: 

^ “ I/ItU 

E[X{t)X{t + r)] = ^ A,7Ti ^ 7i(n)e“"‘"' 1— (7) 

i=i n=o 

where 7 i(n) is the i-th element of vector 7 (n). 

Given two non-negative vectors x = (cci , . . . ,xn) and y = (j/i, . . . , pn), the 
internal product is given by x • y = Equation ® can then be written 

in the form: 

E[X{t)X{t + T)] = ■ S{t) (8) 

where 

Mt-I” 

^(r) = ^7(n)e-^"^^. (9) 

n\ 

n=0 

Note that, in equation |HI), the vector 7 (n) can be recursively computed by 

7 (n) = 7 (n — l)p"^ 

with 7 ( 0 ) = A. Furthermore, the infinite sum in (E} can be truncated such that 
the results are obtained for a given tolerance value. 



Index of Dispersion for Counts. We recall that, to obtain the IDC(t), we 
need to calculate the first and the second moments of the number of data units 
(e.g. packets) arrivals over a time interval. Let \{t) be the arrival rate of data 
units at time t. Note that N{t) = A(s)ds and therefore 

E[N{t)] = [ E[X{s)]ds. (10) 

Jo 

E[A(s)] can be calculated using the results from Markov reward models |2l] as: 



E[N{t)]=tY,e-^^ 

n—0 



(At)” 

n\ 



e;^qA-v(j) 



n + l 



( 11 ) 
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where v(j) is the state probability vector of the uniformized chain after j steps 
and the term in brackets can be recursively calculated as follows: 

77 1 

(^ 2 ) 

I V 1. f b i- 



We use the result obtained by Grassmann in m to calculate the second 
moment of N{t). For conciseness we omit the steps to obtain E[N{T)'^]. We 
note that the development to obtain E[N{T)'^] is slightly different from that of 
j2Sj, since we use our result above to calculate if[A(s)A(t)] (from equation (H|)). 

By definition, 



N{Tf = [ [ = 2 / / \{s)X{t)dsdt, 

Jo J Jo<s<t<T 

E[N{Tf] = 2 [ [ E[X{s)X{t)]dsdt 

J Jo<s<t<T 

Using our previous results obtained from 0 and (0, we have: 



and so, 



(13) 



(14) 



2 oo N 

E[N{Tf] = ^ E Ej+2AT) E , 

j=0 k=l 

where 

3 N 

^fe = EE XipikiJ - i)vk{j), 

i—0 I — I 



(15) 



(16) 



Vk{j) is the probability that the model is in state k at step j and Ej^ 2 ,A{T) is 
the Erlang distribution with (j + 2) stages. 

We note that D^. can be recursively calculated as in 1251 : 



N 

= E + AfeWfe(j) , (17) 

S = 1 



with = AfeUfc(O). 

4 Modules for Traffic Modeling, Analysis 
and Experimentation 

As briefly mentioned in the introduction there are several steps that one may 
perform to study the influence of multimedia sources on system resources. Figure 
[T] outlines some of these steps. 

In this section we describe the tools that we have available in our environment 
to aid in the traffic modeling and analysis process. The traffic modeling tools 
are being integrated as part of TANGRAM-II [T7j providing a rich environment 
for modeling and analysis of computer communication systems. 
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Fig. 1. Basic modeling steps 



From Figure [T] one should first choose the traffic source(s) to be investigated 
and obtain the proper traffic statistics. These statistics can be collected by per- 
forming traffic measurements on the aggregate traffic of an existing network, 
or from pre-recorded individual sources of multimedia streams, such as a video 
sequences coded in MPEG. 

The tool supports statistical collection in two ways. One module is used to 
collect real-time statistics of a packet stream received from a source. Another 
module calculates statistics from a pre-recorded stream. The module accepts as 
input the number of bytes per a given interval and produces first and second order 
statistics as output. The first order statistics are: average traffic rate (bits/time 
unit), variance and burstiness. The second order statistics are: autocovariance, 
autocorrelation and index of dispersion per counts. We can also obtain other 
measures such as the fraction of time above a given rate. 

From the measured statistics we can construct a traffic model that can be 
used in a simulation or in an analytical overall model of the system resources. If 
the traffic model is Markovian, first and second order statistics can be obtained 
using the recursions given in section The overall model containing the traffic 
model plus the network resources model can be built using the the TANGRAM- 
II tool. TANGRAM-II also calculates several measures of interest, such as loss 
probabilities. 

By comparing the values of the traffic descriptors (both from data traces 
and the model) and the values of the calculated measures, we can then refine 
the traffic model as indicated by the dashed lines in Figure [T] It is still an open 
issue to choose the descriptors to be matched in order to obtain a model that 
accurately predicts the performance measures of interest. 
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The objective of the tool is not to automatically create an accurate traffic 
model, but to provide the means to facilitate the user on the modeling construc- 
tion. The user has the option to experiment with different models and study the 
sensitivity of different descriptors on the measures of interest. 

In summary, one module calculates the following traffic descriptors from a 
given sequence: mean rate, variance, autocovariance, autocorrelation, IDC(t), 
fraction of time above a given rate. From the IDC(t) curve, the Hurst parameter 
can also be estimated. So far we only use the so called “naive estimator” for the 
Hurst parameter, that takes into account the slope of the IDC(t) (see section |^. 

The user is not limited to Markovian traffic models. The TANGRAM-H 
simulator allows the specification of inter-event times obtained from samples 
from FARIMA or FBM processes. In this case the user must specify the mean 
rate, variance, time scale, and Hurst parameter. From either the Markovian 
models or the FARIMA and FBM, second order statistics can be obtained, by 
directly recursions (if the model is Markovian) or from a trace generated by the 
simulator. 

We have also implemented a traffic generator for laboratory experimentation. 
This module generates traffic specified by the user from the local computer to a 
given destination (unicast transmission) or a given set of hosts (multicast group) . 
The traffic is specified in terms of frames, i.e. the user gives the frame size (in 
bytes) and the interval between frames. Figure |2] shows this module’s interface. 



(4 Tan^an )i | Traffic Generation | 

Help 




■ □ X 



Run 



Cancel 



Fig. 2. Traffic Generator module 



The user has the option to specify: the size of the packets to be transmitted, 
the total traffic generation time, the interval between frames and the number of 
bytes per frame. Packets from a frame are transmitted in one out of two ways: 
either at a rate equal to the nominal rate of the network board or they are 
uniformly spread over the interval between frames. 

There are three “types” of traffic that can be generated: constant bit rate 
(CBR) traffic, Markov modulated traffic and traffic from a trace. Details of each 
specification are read from a given file. For instance, for the Markov modulated 
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traffic, the user specifies the generator matrix and a reward vector, were entry i 
gives the frame rate at which traffic is generated when the model is at state i. 

As mentioned above, the generator can be connected to a receiver module. 
The module calculates the average packet rate and variance, the jitter distribu- 
tion, the number of lost packets between two packets received, etc. 

The following section presents examples that show some of the curves that 
can be obtained from the set of modules implemented. 

5 Examples 

In this section we present two examples to illustrate the environment we devel- 
oped for traffic modeling and analysis. In the two examples we calculate traffic 
descriptors for a real sequence using the module collect statistics (see Figure [T| 
and obtain several parameters used for building different traffic models. Follow- 
ing the steps outlined in Figure HJ we compute traffic descriptors from the models 
using the module calculate statistics and compare those with the values obtained 
from the real sequence. Finally, we construct a simple performance model where 
a single server queue represents a network channel and is fed by a traffic model 
obtained in the previous step. From this last model we evaluate the queue cell 
loss ratio (CLR). 

The real sequence considered in the first example represents an aggregated 
data traffic taken from Bellcore (pAug.TL) (The sequence was retrieved from 
flash.bellcore.com). The second example also represents an aggregated traffic 
but it is obtained by summing five video sequences: two mtv.I, two soccerWM.I 
and one term_.I (terminator movie) coded in MPEG-1 (These sequences were 
retrieved from ftp-info3.informatik.uni-wuerzburg.de). Figure |3] shows the cell 
rate of this last sequence for three time scales: one frame, ten frames and one 
hundred frames (The time unit is 1 /24 sec per frame) . We can observe from the 
plots that the aggregated video traffic exhibits self-similarity. 




Fig. 3. Cell rate for different time scales 



5.1 Traffic Models 

We consider three traffic models in our experiments: two Markovian models and 
an FARIMA model. The first Markovian model is a MMPP (Markov Modu- 
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lated Poisson Process) based on [l4j , and the second one is a pseudo self-similar 
Markovian model which was proposed by Roberts and LeBoudec m We em- 
phasize that our goal is not to evaluate the accuracy of the models we consider 
in our experiments. Studies concerning the accuracy of pseudo self-similar and 
MMPP models can be found, for instance, in [2^ and [T^ respectively. 

The parameters of the MMPP model are the transmission rates associated 
with each state of the Markov chain, and the state transition probabilities. We 
choose to model the traffic using 8 states as suggested by M- The transmission 
rates are obtained by subdividing the real sequence into eight rate levels, each 
corresponding to a state of the Markov chain. The transition probabilities be- 
tween states are obtained from the relative frequency of the transitions between 
rate levels as measured from the real sequence. 

The pseudo self-similar model proposed in m is a discrete time Markov 
modulated chain with three parameters: a, b, and n, where a > 1, 6 < a, and n 
is the number of states. In our examples we use a similar continuous time model. 
In this model the cell generation is associated only to the first state and is equal 
to one cell per time unit. We consider n = 5 as suggested in m- The parameter 
b is obtained by matching the mean arrival rate of the real sequence with that 
calculated from the model. The method to calculate the parameter a described 
in m is based on the matching of the Hurst parameter. We have obtained the 
value of a in a similar way. The method we used to estimate the Hurst parameter 
is the “naive estimator” as mentioned in section 5] and it is based on the plot of 
the IDC(t). 

In order to evaluate the queue cell loss ratio, we have constructed four mod- 
els using the TANGRAM-H tool. Two of them consist of a finite exponential 
queue with a Markovian traffic source (MMPP or pseudo self-similar) as input. 
The steady state probabilities for the models were obtained using the GTH m 
method which is one of the steady state solvers implemented in the TANGRAM- 
II tool. The GLR is easily computed from the steady state solution. 

The other two models contain a finite deterministic queue. In one model the 
queue is fed by the real sequence trace and in the other model the queue is 
fed by the output of an FARIMA source. Both models were solved using the 
TANGRAM-H simulator. 

Figure S] shows one of the models we have developed. It has a pseudo self- 
similar Markovian traffic source and an exponential service time queue with 
finite buffer with capacity equal to 200 cells. 



5.2 Results 

In what follows, we present a few results we can obtain with our modeling envi- 
ronment using the examples above. 

In the first example we modeled the Bellcore traffic using the MMPP and the 
pseudo self-similar model. The parameters for the pseudo self-similar model are 
a = 6.7, b = 0.5764, and n = 5. The rate vector (in cells per second) associated 
with each state of the MMPP model is equal to: A = (30243, 262125, 22181, 18150, 
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Generate Chain 



setup 



debug_level=0 

max_states=0 

output=/dev/stden' 

file_name=model_200_exp 

max_values= 



Pseudo_self_similar. stage 6 
Setyer_Exp_Queue. queue 501 



name=Pseudo_self_similar 




a= 6.7 
b= 0.5764 

dest= Server_Exp_Queue 
pkt_out= link 



name=Server_Exp_Queue 

• I 1 1 I 1 1 — ® 



Initial! zation= 
queue = 0 

service_rate = 0.05 
queue_size = 200 
port = link 



Fig. 4. Queue model 



14118,10087,6056,2025) and the matrix P is obtained as described in section 

|5Tl 

Figures [SJa) and ED a), show the values obtained for the autocovariance and 
IDC for the models and the Bellcore sequence. From the curves it is evident that 
the model which best matches the descriptors for the time scales plotted is the 
pseudo self-similar model. 

Figure [JDa) shows the CLR as a function of the queue’s load, for a buffer 
length equal to 200 cells. From the curves obtained for the Bellcore sequence 
we can see that the model which better estimates the CLR is the pseudo self- 
similar. The MMPP model is not accurate on the estimation of the CLR for this 
example. This is consistent with our expectations since, from the plots of Figures 
El a) andEl)a), the MMPP model with 8 states poorly match the descriptors. 

A remark is need at this point. The calculation of the descriptors for the 
Markovian models assume a reward model where, at state i, the source generates 
traffic at constant rate \i . The calculations for the CLR assume a MMPP model, 
where a cell is generated at exponential time units with mean l/Aj for state i. 
However, the difference in the two sets of models were negligible for the models 
we consider, since the residence time in a state is greater than the transmission 
rate associated to it. 

In the second example we have modeled the aggregated video traffic using the 
MMPP and the FARIMA model. The rate vector (in cells per frame) associated 
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Fig. 5. Autocovariance 





Fig. 6. Index of dispersion 



with each state of the MMPP model is equal to: A = (1478, 1333, 1188, 1042, 897, 
752,607,462) and the matrix P is obtained as described in section HtTI 

We have considered the following parameters for the FARIMA model: mean 
rate=830.58, variance=34.068, H=0.795, and time scale=l. In this example a 
buffer of 1000 cells was used. 

In the Figures lllb) and|^b), we show the values obtained for the autoco- 
variance and IDC for the models and the aggregated video sequence. From the 
curves we can observe that the MMPP model matches the short term correla- 
tions, until a lag of 2.5 seconds (for the autocovariance) and until a lag of 5 
seconds (for the IDC). On the other hand, the FARIMA model matches the long 
term correlations (for lags above 10 frames). 

Figure Cb shows the CLR as a function of the queue’s load. We can observe 
that the model which better estimates the CLR is the MMPP. The FARIMA 
model underestimates the CLR for loads below 1.1. It is interesting to note that, 
in this example, the model which better estimates the CLR is the one which 
matches the short term correlations. Previous studies [28129113] have shown that 
the number of frames correlations which affects the cell loss is relatively small 
and finite for realistic ranges of buffer size. 
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Fig. 7. Cell loss ratio x load 



6 Summary 

Traffic models are useful to predict the performance measures of networks re- 
sources. They should be capable to predict correlations structures of the traffic 
they represent. We have developed a few modules to aid the user in the traf- 
fic modeling process. These modules allow the calculation of traffic descriptors 
from traces as well as from Markovian models. For Markovian models we have 
obtained recursive expressions for calculating second orders statistics. Using the 
TANGRAM-II simulator, the user can specify FARIMA and FBM traffic sources. 

A traffic generator is also included in the set of tools. The generator is useful 
to perform experiments in a local controlled environment. The tool can generate 
a variety of traffic such as CBR, from Markovian models, FARIMA, FBM and 
traffic from general traces obtained from real data. 

The set of tools is being currently integrated in TANGRAM-II to provide a 
rich environment for traffic engineering. These includes analytical and simulator 
solvers. 



Acknowledgments 

We would like to thank Daniel R. Figueiredo and Garlos F. de Brito for their 
participation in the development of the TANGRAM-II tools. Our thanks also go 
to Adenilson Raniery who implemented the FARIMA and FBM routines that 
are currently part of the TANGRAM-II simulator, and Magnos Martinello who 
participated in the implementation of the traffic generator module. 



References 

1. A. Adas. Traffic Models in Broadband Networks. IEEE Communications Magazine, 
(7):82-89, 1997. 

2. H. Michiel and K. Laevens. Traffic Engineering in a Broadband Era. Proceedings 
of the IEEE, pages 2007-2033, 1997. 




54 R.M.M. Leao, E. de Souza e Silva, and S.C. de Lucena 

3. V.S. Frost and B. Melamed. Traffic Modeling for Telecommunications Networks. 
IEEE Communications Magazine, 32(3):70-81, 1994. 

4. A. I. Elwalid and D. Mitra. Fluid Models for the Analysis and Design of Statistical 
Multiplexing with Loss Priorities on Multiple Classes of Bursty Traffic. In Infocom- 
92, pages 415-425, 1992. 

5. Basil Maglaris, Dimitris Anastassiou, Prodip Sen, Gunnar Karlsson, and John D. 
Robbins. Performance Models of Statistical Multiplexing in Packet Video Com- 
munications. IEEE Transactions on Communications, 36(7):834-844, 1988. 

6. David L. Jagerman and Benjamin Melamed. The Transition and Autocorrelation 
Structure of TES Processes, Part I: General Theory. Communications in Statistics 

- Stochastic Models, 8(2):193-219, 1992. 

7. David L. Jagerman and Benjamin Melamed. The Transition and Autocorrelation 
Structure of TES Processes, Part II: Special Cases. Communications in Statistics 

- Stochastic Models, 8(3):499-527, 1992. 

8. W. Leland, W. Willinger, M. Taqqu, and D. Wilson. On the Self-Similar Nature 
of Ethernet Traffic (Extended Version). lEEE/ACM Transactions on Networking, 
2(1):1-15, February 1994. 

9. Oliver Rose. Simple and efficient models for variable bit rate MPEG video traffic. 
Performance Evaluation, 30(l-2):69-85, 1997. 

10. Oliver Rose. Statistical Properties of MPEG Video Traffic and Their Impact on 
Traffic Modeling in ATM Systems. In Proceedings of the 20th Annual Conference 
on Local Computer Networks, pages 397-406, 1995. 

11. Stephan Robert and Jean-Yves Le Boudec. On a Markov Modulated Ghain Ex- 
hibiting Self-similarities over Finite Timescale. Performance Evaluation, 27:159- 
173, 1996. 

12. A.T. Andersen and B.F. Nielsen. A Markovian Approach for Modeling Packet 
Traffic with Long-Range Dependence. IEEE JSAC, 16(5):719-732, June 1998. 

13. D.P. Heyman and T.V. Lakshman. What are the Implications of Long-Range De- 
pendence for VBR-Video Traffic Engineering. lEEE/ACM Transactions on Net- 
working, 4(3):301-317, June 1996. 

14. Paul Skelly, Mischa Schwartz, and Sudhir Dixit. A Histogram-Based Model for 
Video Traffic Behavior in an ATM Multiplexer. IEEE/ ACM Transactions on Net- 
working, l(4):445-459, 1993. 

15. San qi Li, SangKyu Park, and Dogu Arifler. SMAQ: A Measurement-Based Tool for 
Traffic Modeling and Queuing Analysis Part I: Design Methodologies and Software 
Architecture. IEEE Communications Magazine, 36:56-65, August 1998. 

16. H. Che and S.Q. Li. Fast Algorithms for Measurement-Based Traffic Modeling. 
IEEE JSAC, 16(5):612-625, June 1998. 

17. R.M.L.R. Carmo, L.R. de Carvalho, E. de Souza e Silva, M.C. Diniz, and R.R. 
Muntz. Performance/ Availability Modeling with the TANGRAM-II Modeling En- 
vironment. Performance Evaluation, 33:45-65, 1998. 

18. Ricardo Gusella. Characterizing the Variability of Arrival Processes with Indexes 
of Dispersion. IEEE Journal on Select Areas in Communication, 9(2):203-211, 
1991. 

19. Fabrice Guillemin and Alain Dupuis. Indices of Dispersion and ATM Traffic Char- 
acterization. Technical report. Centre National d’Etudes des Telecommunications, 
Lannion-A, 22300 Lannion, France, 3 1994. 

20. A.R. Reibman and A.W. Berger. Traffic Descriptors for VBR Video Teleconferenc- 
ing over ATM Networks. lEEE/ACM Transactions on Networking, 3(3):329-339, 
1995. 




A Set of Tools for Traffic Modeling, Analysis and Experimentation 



55 



21. Kotikalapudi Sriram and Ward Whitt. Characterizing Superposition Arrival Pro- 
cesses in Packet Multiplexers for Voice and Data. IEEE Journal on Select Areas 
in Communication, 4(6):833-846, 1986. 

22. B.L. Mark, D.L. Jagerman, and G. Ramamurthy. Peakdness Measures for Traffic 
Characterization in High-Speed Networks. In INFOCOM’97, 1997. 

23. A. Jensen. Markov chains as an aid in the study of Markov processes. Skand. 
Aktuarietidskr, 36:87-91, 1953. 

24. Edmundo de Souza e Silva and H. Richard Gail. The Uniformization Method 
in Performability Analysis. Technical report, IBM Research Division, Thomas J. 
Watson Research Center - Yorktown Heights, NY 10598, U.S.A., 2 1996. 

25. Winfried Grassmann. Means and variances of time averages in Markovian envi- 
ronments. European Journal of Operational Research, (31): 132-139, 1987. 

26. A. Ost and B.R. Haverkort. Modeling and Evaluation of Pseudo Self-Similar Traffic 
with Infinite-State Stochastic Petri Nets. In M. Ajmone Marsan et al., editor. 
Formal Methods and Telecommunications. Zaragosa University Press, 1999. 

27. M.I. Taksar W.K. Grassmann and D.P. Heyman. Regenerative Analysis and Steady 
State Distributions for Markov Chains. Operations Research, 33(5):1107-1116, 85. 

28. Bo Ryu and Anwar Elwalid. The Importance of Long-Range Dependence of VBR 
Video Traffic in ATM Traffic Engineering: Myths and Realities. In ACM SIG- 
COMM, 1996. 

29. A. Elwalid, D. Heyman, T.V. Lakshman, D. Mitra, and A. Weiss. Fundamental 
Bounds and Approximations for ATM Multiplexers with Applications to Video 
Teleconferencing. IEEE JSAC, (13):1004-1016, 1995. 




Queueing Analysis of Pools in Soft Real-Time Systems 



Carlos Juiz and Ramon Puigjaner 

Universitat de les Hies Balears 
Departament de Ciencies MatemMiques i InformMica 
Carretera de Valldemossa, km 7.6 
07071 PALMA (Spain) 

Phone: +34-971-172975 Fax: +34-971-173003 

{ dmic jg4 , dmirptO } 0uib . es 



Abstract. Software packages for designing large real-time systems do not 
typically provide any performance tools that will enable the designer to analyse 
the performance of the system that is being designed. In this paper, we present a 
queueing model of the pool, a software component that is commonly used in 
every large soft real-time system. This basic component transfers data among 
tasks without synchronisation in a non-selective manner. This performance 
model can be used in a software package to complement the automatic design 
and generation of soft real-time systems. The queueing model is analysed 
approximately using a decomposition technique. The analytical approximation 
is based on a new variant of the semaphore queue paradigm. Numerical tests 
show that the approximation has a good accuracy by comparison with the 
results obtained from a simulation model of the pool. 



1 Introduction 

Real-time systems differ from traditional software systems in that they have a dual 
notion of logical correctness. Logical correctness of a real-time system is based on 
both the correctness of the output and timeliness. That is, in addition to producing the 
correct output, real-time systems must produce it in a timely manner. A hard real- 
time system is a system that satisfies explicit or bounded response-time constraints 
[10]. On the other hand, a soft real-time system is not so critical to produce an output 
within a deadline. Significant tolerance can be permitted and the software tends to be 
large and complex [19]. 

A software design defines how a software system is structured into components 
and also defines the interfaces between them. The nature of each component depends 
on the concepts and strategies employed by a method. A software design method is a 
systematic approach for creating a system design. During a given design step, the 
method may provide a set of structuring criteria to help the designer in decomposing 
the system into its components [7]. For the design of soft real-time systems, a major 
contribution came with the introduction of the Mascot notation, which provided a 
systematic design method [1]. Based on a data flow approach, the Mascot notation 
formalised the way tasks communicate with each other, either via channels, for 
message communication, or pools, which are information hiding modules that 
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encapsulate shared data structures. Software methods for designing real-time systems 
do not usually provide any performance tools that permit the evaluation of the system 
under study [11]. 

It is assumed that the main value of queueing models is usually as high-level 
models of computer system performance to get an overall view of whether it can meet 
its performance goals, but for more detailed performance analysis, other modelling 
techniques are needed. Instead of this last assumption, we propose a queueing-based 
model for analysing a software component, known as pool in the Mascot design 
notation. This queueing model can be incorporated into a library of performance 
models that will complement a software tool for designing large soft real-time 
systems. 

A common operation in multitasking systems is to transfer data between tasks 
without any need for task synchronisation. Such data movements can be 
unidirectional or bi-directional, selective or non-selective. These requirements of 
communication have lead to the use of two data transfer methods: readers-writers for 
non-selective data transfer and producer-consumer for selective data transfer [4]. 

A producer-consumer entity provides a storage mechanism for passing data from 
the producer to the consumer. The producer puts information in the buffer if there is a 
space available and the consumer gets information from the buffer if it is not empty. 
This software entity is known as channel in the soft real-time design context. Simple, 
random, composite, priority and other kind of channels were studied in [12], [14] and 

[15]. 

The readers-writers structure is a data storage that can be written to or read from 
by any task and at any time. This is a means of sharing data among a number of tasks. 
Senders write information into the store and receivers read such entire stored 
information. Thus, writers and readers need to have no knowledge of each other. 
Despite reading has no effect on the stored data, writing is a destructive operation. 
Moreover, the usage of the shared data is random. This software component is known 
as pool in Mascot terminology. 

In the following section, we review the basic algorithm for analysing semaphore- 
based queueing systems. Section 3 gives a queueing model of the pool by the 
application of the classical single-server semaphore queue and a new multi-server 
semaphore queue. In Section 4, some numerical examples are provided, and finally 
the conclusions are given in Section 5. 

2 Semaphore Queues 



The channel and pool modelling is based on the semaphore queue paradigm. A basic 
channel model is analysed using the algorithm in [6] developed to study the window 
mechanism for the traffic control in a communication network. In this section, we 
describe the basic channel model. This model is used and modified in next section to 
build a basic pool. The basic channel structure can be modelled by two queues in 
tandem as it is shown in figure 1. The first queue is an infinite-capacity queue and its 
server represents the producer. The second queue has a finite capacity of C customers 
and its server represents the consumer. The finite buffer represents the storage in 
which the producer places the produced items, and from which the consumer 
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consumes the stored items. When the finite queue becomes full, the first server gets 
blocked. That is, it cannot start another service until there is space available in the 
finite buffer. When a departure occurs from the producer’s queue, a space becomes 
available, and the consumer becomes unblocked. At that moment, it can start 
servicing a request, if there is one waiting in its queue. This type of queueing network 
is known as a queueing network with blocking before service [17]. In the literature, 
this blocking has been referred to by type 2 blocking, communication blocking, 
immediate blocking or service blocking. Such queueing systems, typically, do not 
have a closed-form solution. Specifically, the queueing network model shown in 
figure 1, has been analysed numerically as a Markov process, or using generating 
functions. It can also be analysed by simulation techniques. This method is 
considerably more execution time intensive than any other of the mentioned 
approaches. However, simulation is the most flexible modelling technique since one 
can practically simulate a queueing network under more realistic assumptions. 
Alternatively, it can be analysed approximately by a decomposition-aggregation 
technique. 



— HTTI-n— ► 

Fkidim- Cmsurner 

Fig. 1. A tandem two-node queueing network with blocking before service. While the Producer 
queue has infinite capacity, the Consumer has a finite capacity of C customers 




Fig. 2. The semaphore-based queueing model of the producer-consumer, i.e. a basic channel 

It can be shown that the queueing network of figure 1 can be transformed into an 
equivalent-based queueing system shown in figure 2. The queueing network consists 
of the producer and the consumer queues, which will be referred to as the inner 
system, and a semaphore station. 

The semaphore station S consists of an input queue /(5), referred to as the customer 
queue, and a token queue e(5), referred to as the resource queue. Customers arrive at 
the system in a Poisson fashion at the rate of A. An arriving customer joins the input 
queue if it finds other customers waiting in the queue. If it finds the queue empty, 
then it requests a token from the token queue. If there is a token available in queue 
e(S), the customer takes a token and then it moves into the inner system. However, if 
the token queue is empty, the customer waits in the input queue f(S) until a token 
becomes available. A customer that enters the inner system, joins the producer queue. 
After receiving service at the producer, the customer departs from the queueing 
system and its token joins the consumer’s queue. After service completion the token 
joins the token queue. 
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Since a customer cannot enter the inner system without a token, the total number 
of customers in the inner system may not exceed C, the maximum number of tokens. 
If there are tokens in e(S), then that means that there are no customers in the input 
queue. On the other hand, if there are customers in the input queue, then e(S) is 
empty. 




Fig. 3. The input queue and the token queue of a semaphore 

In figure 2, symbols commonly used in Petri nets are introduced in order to depict 
the /ork and join operations [18]. In particular, the join symbol of figure 3 depicts the 
following operation. At the instant that queues /(5) and e{S) contain a customer, both 
instantaneously depart from their respective queues and merge into a single customer. 
The fork symbol (figure 4) depicts the following operation. A customer arriving at 
this point is split into two siblings. These two symbols are used for descriptive 
convenience. 



Fig. 4. The. fork operation: a customer arriving at the transition is split into two siblings 

An exact analysis of the queueing system depicted in figure 2 is rather difficult. 
Therefore, it is proposed to analyse it using decomposition and aggregation [5]. The 
basic modelling strategy consists of decomposing the queueing network into a flow- 
equivalent subnetwork and the complement subnetwork. Then the complement 
subnetwork is short-circuited to form a closed network. This way, the flow-equivalent 
subnetwork is analysed in isolation to obtain the throughput across the short-circuit 
for various populations. Finally, the complement subnetwork is replaced by a load 
dependent server in the original network. The server capacity is set equal to the 
throughput of the complement subnetwork (aggregation step). 

In particular, the system shown in figure 3, is first analysed assuming that the 
arrival process at queue e{S) is described by a state dependent arrival rate j{^)-This 
queueing system depicts the semaphore operation described above. The arrival 
process at queue /(5) is assumed to be Poisson distributed and there are C tokens. It is 
also assumed that the interarrival times available at queue e(5) are exponentially 
distributed with a rate ](k), where k is the number of outstanding tokens, i. e. C - k is 
the number of tokens in queue e(S). 

The state of the system in equilibrium can be described by the tuple (i,j), where i is 
the number of customers in queue f(S) and j is the number of tokens in queue e(S). 
The rate diagram associated with this system is shown in figure 5. 
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Fig. 5. Flow rate diagram of a semaphore queue with X arrival rate and }(k) state dependent 
service rate, respectively 

It is noticeable that this system is a birth-and-death process with an arrival rate X 
and a state dependent service rate ](k) if k < C and ](C) if k > C, where k is the 
number of customers in this queue. The random variables, i and j, are related to k as 
the number of customers waiting in the queue, as follows: i = max (0, k - C), and the 
number of customers in the buffer: j = max (0, C - k). The solution of this system is 
obtained by a direct application of classical results. Thus, 

p(i,0) = p‘p(0,0 ) , (1) 



piOJ) = IMp(Q,Q), 

where p= XI '}{€) and 

no')= j. 

1 j = 0 J 

The probability p(0,0) is chosen so that the equilibrium state probabilities add up to 1: 

1 ,^no') (4) 



p(o,or = 



‘ =— +E 

i-P M 






From the above expressions it is possible to compute the probabilities for all states 
and all operational characteristics of the modelled element. In particular, the mean 
queue length would be: 

c (5) 

N = ^cp{Q,C-c)+ cp(c- C,0) = , 



and Nr is a finite addition and 



N^ =p(0,0)~ 



P ( 1 



\-p\{\-p) 



However, all these expressions have been obtained assuming that y(A:) is known. 
This can be approximately obtained by studying the closed queueing network of the 
producer and consumer as shown in figure 6. The analysis of this queueing network 
can be easily carried out since it has been assumed that this network is of the BCMP 
type. Therefore, the throughput of this network can be computed for different values 
of k, the number of customers, where k= 1,2, . . ., C. 
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Fig. 6. Producer-consumer complement subnetwork 

Finally, this throughput is set equal to the arrival rate ](k) of tokens at the token 
queue e(S). The solution existence condition is A < ]{k), where ](k) is the maximum 
throughput and p< 1 [16]. 

3 Pools 

The pool is a data storage that can be written to and/or read from by any task, and at 
any time. It is an effective means of sharing data among a number of tasks, not merely 
a select pair. Senders write information into the store and receivers read such stored 
information as an unit. Thus, writer and reader tasks need have no knowledge of each 
other. Reading has no effect on the stored data, however writing is a destructive 
operation. Moreover, pool usage is a random affair; it is impossible to predict in 
advance when tasks will access the data. 

There are several variants of the readers-writers problem, but the basic structure is 
the same. Tasks are of two types: reader tasks and writer tasks. All tasks share a 
common variable or data object. Reader tasks never modify the object, while writer 
tasks do modify it. Thus writer tasks must mutually exclude all other reader and writer 
tasks, but multiple reader tasks can access the shared data simultaneously [13]. 



Critical 

Sections 



Fig. 7. The readers-writers synchronisation problem modelled with an untimed Petri Net 

In consequence, there are two different classes of servers for two different 
customer classes. In this model either only one writer task can store information into 
the pool or up to C reader tasks can be reading it. Customers arrive at the semaphore 
queues in Poisson fashion at rates and Ar, where is the arrival rate for writers 
and Ar the arrival rate for readers. Service rate for writers is and kjUr is the service 
rate for the k* reader at the input queue. Due to the limited simultaneous service for 
the readers at pool, their service rate is proportional to their number, i.e. from Pr in the 
case of one reader, to Cpr, in the case of C readers. Thus, there is only one server for 
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writer tasks but there are up to C servers for readers tasks available in the pool. 
Therefore, if the maximum number of accepted readers being served in the pool is set 
equal to a limited number of tokens, the associated rate diagram of the extended states 
in the pool inner system is the classical Continuous Time Markov Chain (CTMC) of 
figure 8. 




Fig. 8. Continuous Time Markov Chain representing the flow rate diagram of the inner pool 



This CTMC is isomorphic to the correspondent reachability graph of the 
writing/reading critical sections in the Petri Net (PN) of figure 7, once is transformed 
in its Stochastic Petri Net (SPN) including timed transitions in arrivals and departures 
and the bounds expressed above. 

Under this assumptions, this simple finite CTMC is easily computed C times for all 
the possible number of simultaneous readers n=l,2, ...C. Each initial state P„(0,C,0) 
in every CTMC, is equal to: 



i^„(0,C,0) 



1 + 






- P- 



( 7 ) 



where the nominal utilisations of the server are 



a 



w 




and 







(8) 



( 9 ) 



These yield to the stability conditions, < 1 and < 1. Consequently, the 
steady-state probabilities of every CTMC are obtained from the initial state and the 
global balance equations: 

P„ (0,0,1) = f2„E„(0,C,0), ( 10 ) 



Cp ( 11 ) 

P„(k,C-k,0) = ^P„(0,C,0). 
k\ 

Then, C different probability sets of the tokens are known. Each probability set 
contains n+2 states: P„(0,C,0) is the marginal probability of empty pool, P„(0,0,1) is 
the marginal probability of the pool being accessed by the exclusive writer; and 
finally, the n states, where each Pn(k,C - k,Q) is the probability of the pool being 
accessed by k simultaneous readers. 
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However, the number of reader and writer tasks is not bounded, and they are 
arrival independent processes. When the pool is fully accessed, other writer and 
reader tasks have to wait in first-come first-served (FCFS) scheduling discipline but 
there is no knowledge about the customer order in the waiting queue. On the other 
hand, while the writers are mutually exclusive, the readers may share the pool. Then, 
we propose the utilisation of two different semaphore queues to model a classical 
semaphore queue for the writer tasks and a new multi-server semaphore queue for the 
reader tasks, respectively. 

In consequence, the writers queue is a semaphore queue W constituted by its input 
queue f(W) and its token queue e(W), where there is only a token available. An 
arriving customer, a writer task, joins the input queue if it finds other customers 
waiting in this queue. If it finds the queue empty, then it requests the unique token 
from the token queue. If the token is available in queue e(W), the customer takes the 
token and then it moves into its critical section. However, if the token queue is empty, 
the customer waits in the input queue /(W) until the token becomes available. 

A writer task that enters the inner pool, receives the required service. However, 
the service rate obtained is less than the nominal rate due to the readers exclusive 
utilisation. After service completion the token joins the token queue and customer 
departs from the semaphore queue. Since a customer cannot enter the inner pool 
without the token, the total number of writers in its critical section may not exceed 1. 
If there is the token in e(W), then that means that there are no customers in the input 
queue. On the other hand, if there are customers in the input queue, then e(W) is 
empty. 

The readers queue is a multi-server semaphore queue R constituted by its input 
queue /(/?) and its token queue e{R), where there are C tokens available. An arriving 
customer, a reader task, joins the input queue if it finds other customers waiting in this 
queue. If it finds the queue empty, then it requests a token from the token queue. If 
there is at least a token available in queue e{R), the customer takes the token and then 
it moves into its critical section. However, if the token queue is empty, the customer 
waits in the input queue /(/?) until a token becomes available. 

A reader task that enters the inner pool, receives the required service. Nevertheless, 
the service rate obtained depends on the number of the readers inside the pool and it is 
less than the nominal service rate due to the writers exclusive utilisation. After service 
completion the token joins the token queue and the customer departs from the multi- 
server semaphore queue. Since a customer cannot enter the inner pool without a 
token, the total number of readers in its critical section may not exceed C, the 
maximum number of tokens. If there are tokens in e(R), then that means that there are 
no customers in the input queue. On the other hand, if there are customers in the input 
queue, then e{R) is empty. 

Writer and reader tasks cannot be served concurrently, therefore both task classes 
are delaying the service to each other. In fact, they are mutually exclusive then their 
critical sections never are executed in parallel. 

Therefore, writer and reader semaphore queues are sharing a mutually exclusive 
server with class dependent service rate (see figure 9). 
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Fig. 9. Writer and reader semaphore queues representing a pool. Each semaphore queue has a 
different and mutually exclusive service (critical section). Writers have only one server while 
readers have up to C simultaneous servers 



Although arrivals to f(W) and/(/?) are independent, the arrival of tokens at queues 
e(W) and e(R) are described by two different state-dependent rates, (0 for the writer 
token and e{k) for the reader tokens. The semaphore queues are solved by 
decomposition and aggregation as we explained above in section 2, taking into 
account that W only has one extended state, while R has C different extended states. 
Therefore, these queueing systems depict the semaphore operations described above. 
The arrival process at queue f(W) is assumed to be Poisson distributed. It is also 
assumed that the interarrival times available at queue e(W) are exponentially 
distributed with rate (O. The state of the system in equilibrium can be described by the 
tuple (i, j), where i is the number of customers in queue f(W) and j is the number of 
tokens in queue e(W). The rate diagram associated with this system is shown in figure 
10. This semaphore queue is identical to the one described in section 2. 




Fig. 10. . Flow rate diagram of the writer semaphore queue 



The arrival process at queue /(/?) is assumed to be Poisson distributed and there are 
C tokens. It is also assumed that the interarrival times, available at queue e(R), are 
exponentially distributed with the rate e{k). The state of the system in equilibrium can 
be described by the tuple (/, y), where i is the number of customers in queue /(/?) and j 
is the number of tokens in queue e(R). The rate diagram associated with this new 
variant of a semaphore queue is shown in figure 11. 
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Fig. 11. Flow rate diagram of the readers multi-server semaphore queue 



The solution existence conditions are expressed as Xy,. < (0 and X,- < £{C), the 
correspondent maximum throughputs of W and R, respectively. Under these 
assumptions, it can be computed the following steady-state probabilities from the 
global balance equations [9]: 

pJi,Q) = p:pJQ,Q), ( 12 ) 



Priifi) 



{CpJ 

C 



pM0) = p’pX0,0) 



Vi, 0 < i < c , 



(13) 



C^p 



p^m = -^p^{o,o) = p;p,{o,o) yi>c. 



c 



(14) 



0) 



pJQ,l) = jPjQ,Q), 



(15) 






where p„ = XJoj, p^ = Xls{C) and 



n(;) = 



i-i 



Y\e{C-k) j>0 

k=0 

{ 1 i = oJ 

Consequently, both ps (0,0) states are given by the normalisation as follows: 



(16) 



(17) 



pJ0,0)-‘=-^ + |, 

1-Pw ^ 



(18) 



P.(0,0)- 



i-Py h h 



(19) 



Using these normalisation equations it is possible to compute the probabilities for 
every state and all of the operational characteristics of the pool. In particular, the 
mean queue lengths per class would be the sum of the finite and infinite part for each 
type of task: 
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(0. C - c) + Y, cps(c- C,0) = +Ng_ V5 e {iv, /?} . 



(20) 



Both Nc are finite additions, however due the different modelling of the semaphore 
queues, the mean number of writer tasks waiting in its input queue is given by: 



N^^=Y.cpJ0,0)p:~' =pjm- 

o A 



1 






-+1 



(21) 



As long as the mean number of reader tasks waiting outside the pool is given by: 



= X^Cp,(0,0) 



C - C-2C 2C 



C^P. 



^ c=C+l ^ 



r ^ 

c 



( 22 ) 



and then 



N, =P,(0,0) 



2C Pr 



i-Pr 



i-Pr 



- + 2C 



+ A 



( 23 ) 



where 



A = 



£(C + y)^^^ = £(C + 7)p/ VOO 

j=i ^ j=i 



0 



c = o 



( 24 ) 



These formulas show the natural but important property of the semaphore queues 
that the mean number of customers in the system is identical to an M/M/1 queue when 
C = 0, i.e. there is no buffer. This property holds directly in the computation of the 
mean queue length per class because there is no extended state and then 
yC>5(0,0) = 1 - /3j [8]. Another interesting property of the new variant of the semaphore 
queue is shown just in the case that there is only a reader token available, i.e. C = 1. 
As the M/M/m queue with regard to M/M/1, the mean queue length for the reader 
tasks is identically computed to the mean queue length for the writer tasks [3]. 

Nevertheless, these calculations are obtained assuming that the service rates of the 
semaphore queues, O) and the vector e(k), are known. The use of two different 
semaphore queues guarantees the customer blocking before its service and the 
finiteness of the resource. On the other hand, to guarantee the mutual exclusion 
between writer and reader tasks into the inner system of the pool, it would be 
equivalent to take the unique writer token at queue e(W) than to take the C readers 
tokens at queue e{R), at the same instant. 

Thus, the arrival rates of the outstanding tokens are given by the solution of C 
simple equations and C simple products. When there is a writer task in the inner 
system of the pool, there are no tokens available, i.e. neither writer tokens nor reader 
tokens. This event occurs with probability P„(0,0,1) for n = 1,...,C possible readers 
(see formulas 7 and 10). When there are k reader tasks in the inner system of the pool, 
there are only C - k reader tokens available with probability Pn(k,C - k,Q) for 
n = 1,...,C possible readers (see formula 11). However, it is unnecessary to compute 
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all these values for the C CMTCs because the average writer utilisation of the pool 
server is This last feature avoids the state space explosion for the 

computation of all possible reader combinations. Then, tyand e(k) are given by: 



co = 



£(1-P„ (0,0,1)) 



{C-n + \) 
c 






(25) 



f(^) = [fe(i-nJK. 



(26) 



4 Numerical Examples 

The different examples have been simulated in such a way to obtain a confidence 
interval of at least 5% of the considered mean queue lengths. All examples take 
the arrival rate for writers, equal to the arrival rate for readers and the service time 
for writers and readers as the unit. By this way, it is easier to compare the results 
between the semaphore queue of the writers and the multi-server semaphore queue of 
the readers. In these numerical examples, and others not included in the text, the 
difference between the simulation and the analytical approximation values for the 
performance measurements, were from less than 1% to 3%. Only when the pool 
utilisation was more than 80% this difference has reached to 10% approx. As in the 
classical semaphore queue approximation, the major differences appeared in the 
multi-server semaphore queue when the pool utilisation is close to the server collapse. 
An interesting fact is that the difference of the proposed approximation and the 
simulation tends to be balanced, i.e. a positive difference in the mean queue length of 
writers corresponds to a negative difference in readers and vice versa, but the amounts 
are similar. 

Table 1. Mean queue length examples computed with simulation and the semaphore queue 
approximations for writer and reader tasks in the pool modelling (//„, ’ = C/tf ’) 



Interarrival 
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approx. 


simulation 
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Fig. 12. Mean queue length of the writer tasks (left) and reader tasks (right) as function of the 
same interarrival time, i. e. = 2/*. The number of readers at pool is bounded to 2 



I — □ — writers simulation —m— writers approx. 



I — □ — readers simulation — x — readers approx. 





Fig. 13. Mean queue length of the writer tasks (left) and reader tasks (right) as function of the 
same interarrival time, i. e. = /I/*. The number of readers at pool is bounded to 5 
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Fig. 14. Mean queue length of the writer and reader tasks as function of the maximum number 
of simultaneous readers at pool. The interarrival time is identical for both task classes 
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5 Conclusions and Open Problems 

This paper has proved the possibility of building approximated analytic models of 
typical soft real-time design components such as the basic pool. The advantages of 
such types of performance models are the saving of human debugging and 
computational times compared to the correspondent in the simulation models. 

By the use of the decomposition-aggregation method, the proposed approximate 
analytical model avoids the need of using a simulation with a reasonable degree of 
accuracy. In soft real-time systems, several tasks, without any priority scheduling, 
attempting to access a common resource may be excluded while one task has 
possession. This policy can be achieved with such primitives as locks. This effect is 
generally given as example of what queueing networks cannot model well. However, 
a new multi-server semaphore was modelled to solve this apparent problem. 

The basic pool and other elementary components can be included by the software 
developers in performance libraries to complement system design methodologies as 
Mascot. These performance libraries would be include also simulated basic models to 
implement a mixed performance model of the system with the correspondent 
performance tool. Therefore, queueing performance models of soft real-time 
components could complement the whole design of a large soft real-time system. 

Also it has been given some clues to generalise the approximated analytical 
modelling and performance evaluation of new components with buffering or shared 
variables from the study of channels and pools. The successive application of the 
decomposing technique combined with some iterative, will increase the number of 
modelled elements as available building blocks for developing a new performance 
modelling tool for soft real-time systems. 

This paper should be extended in order to combine the application of decomposing 
techniques with other approximations for new individual components, e.g. priority 
pools, and integrate them to increment the approach alternatives to the whole system 
design. 
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Abstract. In this paper we present a graphical software tool, Xaba, that 
implements exact, approximate, and asymptotic solutions for multi-class 
closed queueing networks with product form solution. It was specifically 
designed for the evaluation of complex systems with large numbers of 
customers of different types, starting from small populations until very 
large ones (e.g., tens of stations and hundreds of customers). The tool was 
developed for Unix systems under the XI 1 environment and is available 
for a variety of platforms. Its functionalities are illustrated via the case 
study of the evaluation of a corporate intranet. 



1 Introduction 

Analytic modeling, using queueing networks or Markov analysis, is a successful 
approach to performance evaluation of computer systems, as it yields good in- 
sight and understanding of the system behavior. Unlike simulation, which allows 
to reproduce complex systems behavior without any substantial difference be- 
tween the model and the real system at often prohibitive computational costs, 
analytic models impose an abstract view of the system in order to meet the 
assumptions for their application but their solution is significantly cheaper from 
a computation point of view. However, due to the large scale deployment of 
interconnection networks with computational servers offering wider and wider 
varieties of services to possibly several thousands of users, real systems have 
become increasingly complex. Nowadays computing systems are configured at 
campus level, whether corporate or academic, and comprise local to wide area 
networks, private and public ones such as the Internet, tens of servers, several 
hundreds of clients, dedicated machines such as firewalls or application servers, 
as well as general purpose multitasking systems. 

Such a complexity is reflected in performance models with many server sta- 
tions and workload types, i.e., large populations of various customer classes. 
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In this case, even analytic modeling becomes computationally expensive. Ex- 
act solutions of large queueing network models become quickly unfeasible from 
a computation point of view even if the population comprises few customer 
classes I4I16I15I18I . Approximate solutions I2I5I7I11I12I17I21I23I are therefore 
the only viable choice, as their computational requests are limited and depend 
upon the accuracy required and the population mix considered. With approx- 
imate solutions, the time to solve complex queueing networks reduces to the 
order of minutes compared to several hours with exact solutions or simulations. 
Although, in general, the error in the solution is not known nor can it be bound, 
they are an effective alternative to exact solution methods. An alternative to 
both exact and approximate solutions is represented by asymptotic bound anal- 
ysis |1Q|11|12|13|14|19|. In this case, the lack of accuracy must be traded-off for 
the almost negligible computation cost. Asymptotic bound techniques, recently 
developed also for multi-class systems [T], are an adequate solution, especially 
for initial evaluation studies during system design. 

In this paper we present a tool, Xaba (Xwindow Approximate and Bound 
Analyzer), for the solution of multi-class closed queueing networks that inte- 
grates exact, approximate, and asymptotic solution techniques. It also provides 
bottleneck analysis for the identification of single and multiple bottlenecks. The 
tool has an open architecture that allows the users to extend the solver set with 
new techniques. It was implemented for Unix systems under the Xwindow X11R6 
development environment for various platforms, namely Sun, HP, IBM. Porting 
to Linux is currently being considered. Queueing network parameters are input 
via a set of dialog boxes. The same network may be solved using different solving 
techniques. One of the advantages of having a tool that incorporates a variety 
of approximate solution techniques is that they can all be executed on the same 
model with no editing of the model itself and results can be easily compared. 
Furthermore, sophisticate “what if” analysis is possible thanks to limited com- 
putation time required by approximate and asymptotic techniques. Results in 
graphical and tabular format are available for all the computed performance 
indices. 

This paper is organized as follows. Section [5] briefly recalls the solution tech- 
niques implemented in the tool and the evaluation methodology combining the 
various solutions. Section E] describes the tool architecture and features. Sec- 
tion |4] offers an example of application of the tool to the performance evaluation 
and capacity planning study of a corporate intranet. Section summarizes our 
contribution and concludes the paper. 



2 Solution Methods 

In this section we briefly review the type of networks that can be solved using 
Xaba and the solvers implemented. Solving a queueing network means finding 
the limiting probabilities for each state of the network, where, in case of multi- 
class workload, a state is described by the vector, one for each station, whose 
components are the number of customers of each class at that station. 
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We consider only product form networks, i.e., networks whose steady state 
joint probability distribution can be expressed as the product of marginal prob- 
abilities for each station [^. Product form networks are characterized by the 
following properties: Mixed (open and closed) multiple classes of customers; sta- 
tion scheduling disciplines among First Come First Served (FCFS), Last Come 
First Served-Preemptive Resume (LCFS-PR), Processor Sharing (PS), and In- 
finite Server (IS); fixed rate service time distributions with rational Laplace 
transform or, in case the discipline is FCFS, exclusively exponential with the 
same mean for all classes; state dependent service times depending only upon 
the queue length for a given customer class, unless they are exponential, in which 
case they may depend upon the total queue length at the station only. 

2.1 Exact Solution Methods 

Product form networks have received considerable attention from the research 
community as they are amenable to efficient solutions. Several algorithms, all 
with comparable computational complexity, have been developed since the mid 
seventies for their solution (see e.g., csnHi). We recall here the Mean Value 
Analysis (MVA) |ini, as it is the exact solver implemented in Xaba. 

MVA is characterized by recursive computation of the performance indices 
for a closed queueing network. In the general multi-class case for a network with 
R classes of customers and K stations, the average response time for a customer 
of class r at station i in a system with V = N fi, where N = is the total 

number of customers in the system and is the total number of customers of 
class r in the system, and /3 = (^,^,...,=^)is the population mix vector, is 
given by 

rp ( ]\r\ / ^ir if i is IS, 

~ \ Dir[l + Qi{K - Ir)] otherwise 

where Dir is the total service demand at station i from customers of class r, 
Qi{N.) is the queue length at station i when there are N customers in the sys- 
tem, and Ir i® ^ vector whose r-th component is 1 and the rest are null. The 
other indices are derived, using Little’s result. The algorithm starts off with a 
population of iV = 0 and proceeds by adding one customer at at time for each 
class, so as to maintain the relative proportions, and recomputing the indices 
at each step. Therefore, the time complexity is 0{RKY\f^i{Nr + 1)) and the 
space complexity is 0{K Wr^r^^S^^ where Vmax i® the index of the class 
with the largest population. Because these values are exponential in the size of 
each class, the solution of large multi-class queueing networks becomes quickly 
computationally un-feasible or too expensive. 

2.2 Approximate Solution Methods 

Approximate solution methods have been developed to overcome the compu- 
tational complexity of MVA. A popular class of approximate methods are the 
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local ones that approximate the queue lengths about a given population vector 
in order to transform the MVA recursion into a closed system of fixed-point 
equations. Several local approximation techniques have been developed over a 
few years I2I5I7I11I12I17I23I21I . The general scheme of approximate solutions is 
the following. An initial estimation of the queue lengths for the current state is 
provided, based on which the queue lengths at previous states (i.e., states with 
fewer customers) are computed. The newly computed queue lengths are then 
used to recompute the queue lengths in the current state. The procedure is iter- 
ated until the requested accuracy is obtained. The various algorithms differ for 
the way the queue lengths at previous states are computed, and the number of 
previous states considered. Unlike MVA, which considers all previous states for 
a given one, i.e., for any N_ customers the computation starts from 0, approx- 
imate methods consider only iV — iznz! or possibly further previous states 
|5]. This limits the computational complexity of these algorithms to 0{KR^) 
for X = 2 or 3, depending upon how many previous states are considered. Be- 
cause the solution is approximate, potentially large errors may affect the queue 
lengths, depending upon the initial estimation and the population distribution. 
In general, more accurate results are obtained at additional computational costs 
as more previous states are considered. 

Xaba implements the Schweitzer-Bard (SB) j2l1 7| . the Linearizer (LIN) [S], 
and the Queue Shift Approximation (QSA) [20j . whose computational complex- 
ities are 0{KR^),0{KE?) and 0{KR?) respectively . Table [U summarizes the 
approximations used for each of the implemented algorithms. 



Table 1. Queue length approximation and complexity for the algorithms in Xaba. 
The function S^a is the selector function eqnal to 1 if r = s and 0 otherwise. 



Alg. 


Queue Length Approximation 


SB 


Qir{N IJ ~ Qir{N) 


LIN 


QirCiy-it-iQ QiAN-lJ QiAN-l^) Qir(N) 

Nr--Sr^t Nr^ ~ iVr 


QSA 


mN) - Qi{N - IJl - [Q,(A - 1,) - Qi{N - 1, - 1,)] ~ 




[QiiK - IJ - Qi{N - 1, - 1 J] - [Qi{N - 1, - - Q,(A - 1, - 1, - IJ] 



2.3 Asymptotic Bounds Methods 

When the population is very large, bounds on the performance indices are of- 
ten sufficient for the purposes of an evaluation study. Under such conditions, 
identifying the bottleneck resources is also of interest. 

Several bound techniques have been derived that allow to obtain lower or up- 
per bounds on the performance indices for large populations |1l1()l11H2l13ITTlTTlj . 
The various techniques differ for the tightness of the computed bounds and the 
region of validity in the population space where they apply. Asymptotic Bound 
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Analysis based on Operational Analysis j9], Balanced Job Bounds [12], Com- 
posite Bounds [T^, and Asymptotic Expansion are the bound techniques 
implemented in Xaba. 

Bottleneck analysis completes the evaluation study of the system under stress 
conditions. Single and multiple bottlenecks, i.e., resources that are jointly loaded 
to saturation by more than one class, are identified as a function of the population 
mix and the loadings. Furthermore, the state space of the population mixes is 
divided into a set of regions, each of which leads to saturation one or a set of 
resources when the population grows to infinity. Thus, for a given set of service 
demands, it is possible to forecast the bottleneck resource as a function of the 
population mix. 



2.4 Evaluation Methodology 

A three step methodology for the evaluation of complex systems with a large 
number of customer classes and a large population size is suggested when using 
Xaba. Each of the available solution techniques can be applied where it is most 
suitable. Exact solutions may be computed efficiently for small population sizes, 
where the accuracy of the solution is usually more important. For intermediate to 
large population sizes, approximate solutions may be computed, as estimations 
may be sufficient even though they might not be as accurate. Unfortunately, 
there is no way to guarantee such a quality from a theoretical point of view, 
nor to bound the error. However, since the bound techniques available allow 
to define the feasible region where the actual performance indices should lie, 
they can be used to verify at least the validity of the solution obtained during 
the second phase step with an approximate algorithm. Such an application is 
useful when errors are particularly large. At very high loads, it is often sufficient 
to capture the trend of the performance indices and define the feasible region 
for the solutions. For this reason, asymptotic bounds can be composed and the 
tightest ones used for each segment where they are best. 

3 Xaba 

In this section we give an overview of the tool architecture and functionalities 
. Examples of its usage are provided in the next section, as an illustration of 
the case study considered. 



3.1 The Architecture 

The goal of Xaba is to provide an easy to use set of queueing network solvers 
with user friendly interface for the input (i.e., model definition) and the output 
(i.e., model solution reporting) phases. The tool design is an open architecture so 
that new solution methods can be easily incorporated in the algorithmic kernel. 
An algorithm developed by [2T] was recently added by one of the authors. 
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It is comprised of five independent modules, as illustrated in Figure [H namely 
an INPUT module, for the acquisition of the queueing network parameters; a 
MODEL module, for queueing network internal data manipulation; a SOLVER 
module, representing the algorithmic kernel that implements the various solu- 
tion methods; an output module, for results visualization; and an interface 
module, for GUI management. 

The tool was developed for Unix systems under the development environment 
Xwindow X11R6, using the following libraries: Xlib, Xt intrinsics, OSF/Motif, 
Xbae Widget, Athena Tools Plotter Widget, and LibXG Widget. The current 
release 3.2 is available for SunOS 4.1.3, HP-UX 9.05, IBM AIX 4.1.1 and is 
currently being ported to Linux. 




Fig. 1. A schematic view of Xaba architecture. 



3.2 Functionalities 

All functionalities are accessible via pop up menus and icon tool bars. They can 
be grouped into three main categories: model editing, model solution, and results 
visualization. On-line help is also available. An example of the user interface is 
reported in Figure[2l Model editing comprises model creation, modification, save 
on disk, copy, and deletion. A model is constructed by specifying the number 
of stations and the number of customer classes. For each station, the station 
type (queue or delay), the service time and visit ratios, or the product thereof, 
i.e., service demands or loadings, must be specified. For each customer class, 
the fraction of customers must be specified. Model parameters, whether relative 
to the stations or the population, can be edited during a session. In particular, 
Xaba distinguishes between session models, i.e., those models that have been 
loaded or just created during a session, and stored models. This allows the user 
to duplicate and modify a session model without having to duplicate the file at 
file system level, i.e., outside the application execution. The number of models 
simultaneously present during a session can be greater than one, such a number 
being limited by the system memory configuration only. However, only one model 
at a time can be active, i.e., under evaluation. Removing a model from the active 
set will not cancel the file from the file system. 

Access to the solver kernel is obtained by first selecting the format of the 
results, whether tabular or graphical, via distinct menus. Either menu allows 
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Fig. 2. Examples of dialog windows for model input and solver selection. The general 
window is in the background, in the foreground is the solver and performance indices 
selection window. 



to choose the solution required, bottleneck analysis or performance indices (i.e., 
exact or approximate solution). With bottleneck analysis, single and multiple 
bottlenecks are identified, if there is any. In this case, for each station, the popu- 
lation mixes are reported that make that station become the network bottleneck, 
if this is possible. All combinations of station pairs and triples are also considered 
to identify multiple common bottlenecks. Tabular results are given. Graphical 
representation are available only up to three classes. With the performance in- 
dices computation, throughput, utilization, response time, waiting time, and 
queue length, can be computed for a given maximum number of customers, us- 
ing any one of the available solvers, whether exact, approximate, or asymptotic. 
Multiple indices cannot be selected simultaneously. 

Results visualization is available in tabular and graphical format. Results are 
computed, when possible, at system and at station level, globally and per class 
in both cases. The granularity of such a computation, i.e., the number of inter- 
mediate results that will appear in the table or be plotted in the corresponding 
graph, is defined by the user. The finer the granularity, the smoother the graphs 
but more expensive the computation. All results can be saved on the disk in 
ASCII format in case of tabular results (in postscript format in case of graphs) 
or printed as hard copies. 
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In the next section a performance evaluation study is presented that illus- 
trates the use of tool and the three step methodology it allows to adopt. 



4 An Evaluation Study 

In this section we illustrate the use of Xaba on an example scenario consisting 
of a corporate intranet connected to the Internet via a firewall. We first describe 
the scenario from a qualitative point of view and then present the model and 
the results of the analysis with Xaba. 



4.1 The System 

The scenario we consider is depicted in Figure 0 A corporate intranet is con- 
nected to the Internet via a firewall. The firewall architecture is based on two 
routers, an external one and an internal one, that define a perimeter network 
(De-Militarized Zone, DMZ) where bastion hosts offer a set of services to both 
external and internal users [6]. The external Web server, the e-mail server for 
mail exchange with the rest of the world, and the anonymous ftp server for docu- 
mentation download are accessible from outside the corporate network and from 
the intranet. The internal Web server, the http proxy server, the internal e-mail 
server for corporate mail, and the application and database server are accessi- 
ble to all internal user^ and to users coming from the outside world that have 
successfully authenticated themselves at the firewall. Services offered by other 
servers, such as DNS or Telnet proxy, are not considered here as their impact 
on performance is limited in our case. Both the internal and the external LAN’s 
are 100 Mbps Ethernet networks. Users are distinguished between external and 




mobile 




trusted 

customers 



Fig. 3. The corporate network of the considered scenario. 

^ Access control policies that usually are in place in any intranet are implicitly taken 
into consideration. 
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internal. External users exhibit two types of behavior, depending on the level of 
trust the corporate network associates with them. Users indicated as “mobile” 
in the figure are admitted to the private network after a strong authentication 
procedure, as they represent corporate employees working off site. They also ac- 
cess the ftp server occasionally for public document download. Users indicated as 
“trusted customers” in the figure represent corporate customers that, by visiting 
dynamic pages with restricted access on the external web server, also indirectly 
access the internal DB server. Internal users exhibit two types of behaviors, de- 
pending upon whether they only remain within the corporate domain or access 
the Internet. Internal users that remain within the intranet domain access the 
internal Web server and the DB and application server, while those who exit the 
intranet perimeter mostly do Web browsing. Such an activity generates requests 
on the http proxy cache and, in case of cache miss, the external web server and 
the Internet. 

A qualitative summary of the behavior of internal and external users is given 
in Table |2l as a function of the resources each user class visits. We distinguish 
the behaviors caused by a cache hit or miss on the http proxy cache, as they 
involve different sets of resources. The customer classes are therefore indicated 
as “proxy” and “Internet”, respectively, while “intranet” denotes the internal 
customers operating exclusively within the private network perimeter. External 
users classes are indicated as “mobile” and “trusted.” 



Table 2. User classes and system resources of the corporate scenario considered: who 
accesses what. 
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4.2 The Model 

The system described in the previous section is now modeled with a multi-class 
closed queueing network, as depicted in Figure 2] Some assumptions were made 
in order to be able to solve the model using Xaba solvers. However, they are 
reasonable assumptions when a macro-model such as the one considered here 
is analyzed. The two 100 Mbps Ethernet LAN’s are modeled with a fixed rate 
server as we assume a low collision rate guaranteed by the network bandwidth. 
Because of the low CPU usage and mainly disk usage, both the DB and Web 
servers are modeled as fixed rate servers. Finally, the Internet is modeled as 
a pure delay station. The model comprises fifteen stations with five classes of 
customers and will be analyzed for up to a thousand customers. 




Fig. 4. The queueing network model of the corporate network considered. 



4.3 Results 

In this section we present the results of the “what if” analysis of the system con- 
sidered under various hypotheses. Two cases are analyzed. In the first case, the 
impact on performance of the population mix is investigated. Then, upgrading 
actions will be considered in order to improve response time for a specific class 
of customers. 

Figure El reports the output window the tool generates for the utilization of 
the seven most utilized resources (i.e., external and internal web servers, http 
proxy cache server, DB server, FTP server, internal and external e-mail servers) 
with 1000 customers and (3 = (.1, .1, .1, .35, .35). The results shown in the figure 
were obtained solving the network using Linearizer, which required less than 
three minutes on a Sun SparcIO with 64 MB of main memory running SunOS 
4.01. Solving the same network with only 150 customers using MVA required 
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Fig. 6. Individual resource utilization with up to 1000 customers, /? = 

(.1,.1,.1,.35,.35). 



about nine hours on the same machine. The resource name, the class for which 
the index was computed (all classes in our case), and the solution method used 
(Linearizer, LIN in this case) are associated with each of the colors and line types 
used in the graph. The values of individual points can be read in the “X” and 
“Y” buttons below the graph when the “Request position” option is selected. 
They can be read next to the cursor in the graph when the “Show position” 
option is selected instead. Closer looks at user defined regions are obtained with 
the “Zoom” option. As the figure shows, the primary bottleneck is the external 
web server, with the secondary bottleneck on the internal web server. The former 
saturates around 200 customers, the latter around 400. However, the analysis 
up to 1000 customers shows that under the given workload conditions, the http 
proxy tends towards saturation as the population size increases beyond 700 cus- 
tomers. The capacity planning study analyzes the impact on system performance 
of changes in the population mix due to projected success of the corporate In- 
ternet site, which is expected to attract more and more external users. In order 
to investigate the bottleneck switching with respect to the range of population 
mixes, the network was solved for mixes ranging from f3 = (.3, .3, .3, .05, .05) 
to /3 = (0.0333,0.0333,0.0333,0.45,0.45), i.e., from 10% to 90% internal users. 
Since the tool provides also tabular results in ASCII format, composite graphs 
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Fig. 6. Bottleneck switching among the internal web server, external web server, DB 
server for 200 customers as a function of the total percentage of internal users. 



can be obtained to summarize a collection of experiments. FiguresEI and 0 illus- 
trate the behavior of utilization of the four most utilized resources (i.e., external 
and internal web servers, http proxy and DB server) with 200 and 600 cus- 
tomers, respectively, as a function of the total percentage of internal users. With 
200 customers, the bottleneck is the DB server from 100% to about 70% of in- 
ternal users (/3 = (.2333, .2333, .2333, .15, .15)), then it switches to the internal 
web server from 70% to about 48% internal users (/3 = (.16, .16, .16, .26, .26)), 
and finally switches to the external web server for all remaining mixes. With 
larger populations, e.g., 600 customers as depicted in Figure |7] the absolute 
utilization of all resources is higher. The overall behavior is similar although 
not as evident. Unlike the previous case, more than one resource saturates si- 
multaneously. The http proxy server saturates as well for high percentages of 
internal users. We now consider the potential benefits of upgrading actions on 
the servers. Under the baseline conditions of Figure i.e., with 1000 customers 
and j3 = (.1, .1, .1, .35, .35), the external web server is the bottleneck resource 
that saturates around 200 customers. The response time under such conditions, 
which cannot be shown for the lack of space, quickly reaches values in the range 
of 150 s. A response time of two and a half minutes is obviously unacceptable 
if the corporate wishes to promote its Internet site. Therefore, the external web 
server needs a substantial upgrade action. Figure [HI illustrates the response time 
for each of the external classes on the three servers in the DMZ when the ex- 
ternal web server has been replaced with a new one 1.6 times faster than the 
old one. In this case, the maximum response time with 1000 customers is less 
than 7 s. 
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Fig. 7. Bottleneck switching among the internal web server, external web server, DB 
server and http proxy for 600 customers as a function of the total percentage of internal 
users. 



5 Conclusions 

In this paper we have presented a graphical tool for the exact, approximate, 
and asymptotic bound analysis of multi-class closed queueing networks with 
product form solution. Thanks to the simple user friendly graphical interface 
detailed sensitivity analysis of complex queueing networks can be performed with 
minimum effort. Furthermore, such analysis is supported by computationally 
efficient solution methods, which allow to solve even very complex networks in 
few minutes. 

Exact solutions can be computed for small to medium population sizes, fol- 
lowed by approximate solutions for larger populations. Asymptotic bounds can 
then be applied to both validate the approximate solutions (whether they fall 
within the feasible region they determine) and to capture the trend of the per- 
formance indices as the population is increased further. The functionalities it 
implements were demonstrated on a case study of a corporate network with 
thousands customers and tens of servers. 

Acknowledgments 

The authors are grateful to all the students who contributed to the development 
of the tool with their master projects. 




84 



P. Cremonesi, E. Rosti, and G. Serazzi 





Response time 






















-©-Ext_Mall, Mobile SB 
-«*- Ext_Mall, Costumer SB 
^ Ext_Ftp, Mobile SB 
Ext_Ftp, Costumer SB 
Ext_Web, Mobile SB 
Ext_Web, Costumer SB 






























































































— 



























0, 


















—1 








0 100 200 300 000 500 600 700 800 900 1000 

Customers 


H Shou Legend B Show Grid ^ Show Position )!: | 

1 Zoom Show Subgrid \y Request Position y, | 


Cancel | I Print-Save | Model Par. | 



Fig. 8. Response time of the upgraded system with the baseline population mix. 



References 

1. G. Balbo, G. Serazzi, “Asymptotic analysis of multi-class closed queueing networks: 
multiple bottlenecks,” Performance Evaluation, Vol 30(3), pp 115-152, 1997. 

2. Y. Bard, “Some extensions to multi-class queueing network analysis,” In A. Butri- 
menko M. Arato, E. Gelenbe, Editors, Proc. of flh Int. Symposium on Modeling 
and Performance Evaluation of Computer Systems, pp 51-62, Amsterdam, Nether- 
lands, North-Holland Publishing Gompany, 1979. 

3. F. Basket!, K.M. Chandy, R.R. Muntz, R. Palacios, “Open, closed and mixed 
networks of queues with different classes of customers,” Journal of the ACM, Vol 
22(2), pp 248-260, 1975. 

4. J.P. Buzen, “Gomputational algorithms for closed queueing networks with expo- 
nential servers,” Communications of the ACM, Vol 16(9), pp 527-531, 1973. 

5. K.M. Ghandy, D. Neuse, “Linearizer: a heuristic algorithm for queueing network 
models of computing systems,” Communications of the ACM, Vol 25(2), pp 126- 
134, 1982. 

6. D.B. Ghapman, E.D. Zwicky, “Building Internet hrewalls,” O’Reilly & Associates, 
Inc., 1995. 

7. W.M. Chow, “Approximations for large scale closed queueing networks,” Perfor- 
mance Evaluation, Vol 3(1), pp 1-12, 1983. 

8. P. Cremonesi, P.J. Schweitzer, G. Serazzi, “A unifying framework for the approxi- 
mate solution of closed multi-class queueing networks,” submitted for publication, 
1999. 




Xaba: Solvers for Multi-class Closed Queueing Networks 



85 



9. P.J. Denning, J.P. Buzen, “The operational analysis of queueing network models,” 
Computing Surveys, Vol 10(3), pp 225-261, September 1978. 

10. L.W. Dowdy, B.M. Carlson, A.T. Krantz, S.K. Tripathi, “Single class bounds of 
multi-class queueing networks,” Journal of the ACM, Vol 39(1), pp 188-213, Jan- 
uary 1992. 

11. D.L. Eager, K.C. Sevcik, “Performance bound hierarchies for queueing networks,” 
ACM Transactions on Computer Systems, Vol 1(2), pp 99-115, 1983. 

12. D.L. Eager, K.C. Sevcik, “Bound hierarchies for multiple-class queueing networks,” 
Journal of the ACM, Vol 33(4), pp 179-206, 1986. 

13. T. Kerola, “The composite bound method for computing throughput bounds in 
multiple class environments,” Performance Evaluation, Vol 6(Y), pp 1-9, 1986. 

14. C. Knessl, C. Tier, “Asymptotic approximations and bottleneck analysis in prod- 
uct form queueing network with large populations,” Performance Evaluation, Vol 
33(4), pp 219-248, 1998. 

15. M. Reiser, H. Kobayashi, “Queueing networks with multiple closed chains: theory 
and computational algorithms,” IBM Journal of Research and Development, Vol 
19, pp 283-294, 1975. 

16. M. Reiser, S. Lavenberg, “Mean-value analysis of closed multi-chain queueing net- 
works,” Journal of the ACM, Vol 27(2), pp 313-322, 1980. 

17. P.J. Schweitzer, “Approximate analysis of multi-class closed queueing networks of 
queues,” Proc. of Int. Conference on Stochastic Control and Optimization, Ams- 
terdam, April 1979. 

18. P.J. Schweitzer, “A survey of mean value analysis, its generalizations, and appli- 
cations, for networks of queues,” Proc. of the Second International Workshop of 
the Netherlands National Network for the Mathematics on Operations Research, 
Amsterdam, February 1991. 

19. P.J. Schweitzer, G. Serazzi, M. Broglia, “A survey of bottleneck analysis in closed 
networks of queues,” in L. Donatiello R. Nelson Eds., Proc. of Performance evalua- 
tion of computer and communication systems. Joint tutorial papers of Performance 
’93 and Sigmetrics ’93, LNCS 729, pp 491-508, Springer Verlag, 1993. 

20. P.J. Schweitzer, G. Serazzi, M. Broglia, “A queue-shift approximation technique 
for product form queueing networks,” In R. Puigjaner, N. N. Savino, B. Serra 
Eds., Proc. of 10th Int. Conference on Modeling Technigues and Tools, pp 267-279, 
LNCS 1469, Springer Verlag, 1998. 

21. H. Wang, K.C. Sevcik, “Experiences with improved approximate Mean Value Anal- 
ysis algorithms,” In R. Puigjaner, N. N. Savino, B. Serra Eds., Proc. of 10th Int. 
Conference on Modeling Technigues and Tools, pp 280-291, LNCS 1469, Springer 
Verlag, 1998. 

22. Xaba, available at http://www.elet.polimi.it/Users/DEI/Sections/Compeng/- 
Paolo.Cremonesi/Projects/Xaba 

23. J. Zahorjan, D.L. Eager, H.M. Sweillam, “Accuracy, speed and convergence of 
approximate mean value analysis,” Performance Evaluation, Vol 8(4), pp 255-270, 
1988. 




Decomposition of General Tandem Queueing 
Networks with MMPP Input 



Armin Heindl 

Technische Universitat Berlin, 
Prozefidatenverarbeitung und Robotik, 
Franklinstr. 28/29, 10587 Berlin, Germany, 
heindlScs . tu-berlin. de 



Abstract. For tandem queueing networks with generally distributed 
service times, decomposition often is the only feasible solution method 
besides simulation. The network is partitioned into individual nodes 
which are analysed in isolation. In existing decomposition algorithms 
for continuous-time networks, the output of a queue is usually approx- 
imated as a renewal process, which serves as the arrival process to the 
next queue. In this paper, the internal traffic processes are described as 
semi-Markov processes (SMPs) and Markov modulated Poisson processes 
(MMPPs). Thus, correlations in the traffic streams, which are known to 
have a considerable impact on performance, are taken into account to 
some extent. A two-state MMPP, which arises frequently in communica- 
tions modeling, serves as input to the first queue of the tandem network. 
For tandem networks with infinite or finite buffers, stationary mean 
queue lengths at arbitrary time computed quasi-promptly by the decom- 
position component of the tool TimeNET are compared to simulation. 



1 Introduction 

Tandem queueing networks arise in a wide range of applications, where cus- 
tomers, jobs, packets, etc. are serviced by a series of queueing systems. Often, 
general service time distributions as well as finite buffers are required for different 
nodes. In addition, the arrival process to the first queue should be able to capture 
correlations and burstiness, since real traffic often exhibits these characteristics. 

In this paper, the input to the tandem queueing network is assumed to be a 
Markov modulated Poisson process with two states (MMPP(2)). Many different 
procedures for matching such an MMPP to observed traffic have been published 
(e.g., IM). The nodes of the tandem queueing network are represented as 
single-server FIFO systems with or without a finite buffer, i.e., in the Kendall 
notation ■ jGjXjK or ■ jGjX. Arrivals to a full buffer will be lost. 

Tandem queueing networks of this general type do not lend themselves to an 
exact analysis - primarily due to concurrent non-exponential activities. Apart 
from simulation, an approximate analysis technique known as decomposition can 
be applied in principle. The nodes of the network are analysed in isolation. The 
output traffic of a single queueing system is characterized and ~ in case of tandem 



B.R. Haverkort et al. (Eds.): TOOLS 2000, LNCS 1786, pp. 86- Hool 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 



Decomposition of Tandem Networks 



87 



networks - serves as the arrival process to the subsequent queue. Generally, 
decomposition delivers various (stationary) performance measures, like mean 
waiting times, mean queue lengths, loss probabilities, etc., very quickly. 

Existing decomposition algorithms for queueing networks of continuous-time 
queues I12I14I19I16I are usually based on the assumption that all traffic pro- 
cesses - input as well as traffic between queues - can be well approximated 
by renewal processes. As a consequence, they cannot handle an MMPP input. 
Furthermore, studies of a discrete-time dual tandem queue in [8] reveal how de- 
composition results can be improved by an internal traffic characterization which 
takes into account correlations of the interdeparture intervals (i.e., discrete-time 
semi-Markov processes (SMPs) in this case) as compared to renewal processes. 

In the case of continuous time, internal traffic can be described by continuous- 
time SMPs and MMPPs with two states as proposed in this paper. This naturally 
allows MMPP(2) sources as input to the first queue. For example in communi- 
cations modeling, MMPPs serve to model overflow from a finite trunk group or 
the superposition of packetized voice processes and packet data |S]. Depending 
on the application, MMPP input as well as continuous service time distributions 
at the nodes may rather suit the modeler’s needs. 

The decomposition algorithm proposed in this paper is implemented and cur- 
rently integrated into the software tool TimeNET m- Depending on the specific 
tandem network, the algorithm may be forced to resort to existing decomposi- 
tion techniques based on renewal processes (instead of SMP(2)s and MMPP(2)s), 
before the last node has been reached. Thus, the new approach may rather be 
viewed as an extension of the framework of decomposition than a stand-alone 
solution method. 

In Sect. 2, we shortly review the traffic processes and their relevant character- 
istics employed in the proposed procedures. Section 3 outlines the approximation 
of the output processes of single-server queues with or without finite capacity as 
SMPs, while Sect. 4 treats the conversion of this traffic descriptor into an arrival 
process of better analytical tractability, i.e., preferrably into an MMPP. Tandem 
configurations of queues of different types are analysed in Sect. 5, followed by 
concluding remarks in Sect. 6. 

2 Traffic Descriptors 

Previous decomposition approaches commonly assumed that all traffic processes 
can be well approximated by simple renewal processes. These recurrent point pro- 
cesses are defined by independent and identically distributed interarrival times 
of customers. Moreover, decomposition algorithms often only deal with the first 
two moments of the interarrival time for reasons of efficiency fTUITTT| . 

Let D be the random variable of the renewal period between the occurrences 
of customers, which may be interpreted as arrivals or departures. Then, in the 
context of decomposition, the renewal process is characterized by the rate Xd = 
and the squared coefficient of variation (SCV) c|, = ’ where E[D] 

and Var(D) denote the expectation and variance of D, respectively. 
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Apart from the fact that this two-parameter approach cannot distinguish 
between distributions with the same mean and variance (e.g., | could 

stem from an Erlang-3 or a uniform distribution), the knowledge on the traffic 
processes within queueing networks and their correlation structure raise the 
question of the validity of the renewal assumption. It is well known that corre- 
lations between interarrival times have a significant impact on the performance 
indices of queues. MMPPs and SMPs allow to capture these correlations while 
retaining analytical tractability. 

In the following, we will define MMPPs and SMPs. In both cases, a restriction 
to two states keeps the proposed procedures efficient. 



2.1 Two-State MMPPs 

A two-state MMPP [3] is a point process whose arrival rate varies randomly 
over time between two values. More precisely, the arrival rate is governed by an 
ergodic continuous time Markov chain (CTMC), in our case with the state space 
{0, 1} and the generator matrix Q: 



Q 



-ro ro 

ri -ri 



Let 7T = (ri, ro) be the steady state vector of this Markov chain such that 
ttQ = 0 and Tre = 1, where e = (1,1)^. In state i, customers arrive with a 
constant rate Xi (i = 0, 1). Thus four rate parameters together with an initial 
distribution vq completely define the two-state MMPP. In the BMAP notation, 
an MMPP(2) is given in terms of the two matrices 



Do = 



-{ro + Ao) ro 
ri -(ri + Xi) 



and 



Ao 0 
0 Ai ’ 



by which many relevant characteristics can be conveniently expressed. 

In Sects 3 and 4, we will consider the moments of the counting function N{t), 
the number of arrivals in (0,t] of a time-stationary MMPP(2). From [419] . we 
obtain the mean, variance and third centralized moment of N{t) as 



E[{N{t)] = nDie ■ t 

Var{N{t)) = E[N{t)] + 2t {{-nDief - -kDi{Do + Di + eiry^Die) 
+2ttDi (^e‘'^o+D,)t _ + Di + ctt )”^ Die 

E[{N{t) - E[lV(f)])3] = 5(t) - 3A[lV(t)](£;[iV(t)] - 

-E[N{t)mN{t)]-l){E[N{t)]-2) 



where 



g{t) = f ^ + ^^+Aoit . 

J’o + »"i V 6 2 J 



6 



2 
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Expressions for Aij in terms of the parameters Aq, Ai, tq, ri can be found in [^. 

As for renewal processes, the SCV can be used to characterize traffic bursti- 
ness, i.e., the variability of the arrival stream. But due to the correlations in the 
MMPP, the index of dispersion for counts I{t) (IDC), defined as 



_ Var{N{t)) 

^ ’ EWit)] 



( 1 ) 



proves to be a more satisfactory burstiness descriptor. The limiting index of 
dispersion / = limt^oo I{t) is of particular interest. For MMPP(2)s, it holds: 



I = 



lim 



Var{N{t)) 

E[N{t)] 



l + 2(7riAie — ttDi{Do + Di + en) ^Die 

\ nDie 

. 2(Aq - Ai)^ron 

(ro + ri)2(Aori - Airo) 



The interrupted Poisson process (IPP) constitutes a well known special case of 
an MMPP with two states. For an IPP, one arrival rate - say Ai - equals zero. 
MMPPs generally fall into the class of SMPs jl]. 



2.2 Two-State SMPs 

In contrast to MMPPs, every state transition of an SMP corresponds to a cus- 
tomer arrival. The successive states form a discrete time Markov chain (DTMC), 
while the time required for each move from one state to the next is a random 
variable whose distribution function may depend on both the source and the 
destination state. Let Zn be the state of the stochastic process right after the 
nth transition, while Yn denotes the jump time between the (n — l)th transi- 
tion and the nth transition. Then, an initial distribution vg and the so-called 
semi-Markov kernel Q(t) completely define the stochastic process. For two-state 
SMPs, this kernel is a 2 x 2 matrix: 

Q(t) = (F{Yn < t,Z„ = = i})y , i,J = 0,1 . 

Alternatively, the semi-Markov kernel may be expressed by means of the tran- 
sition probability matrix i? of the embedded DTMC and the jump time distri- 
bution functions Fij(t) = P{Yn < t\Zn = j, Zn-i = i} conditioned on both the 
source and the target state. The following relations hold: 

R = (rij) = lim Q{t), Re = e 

t^OO 

We require that Q(0)ij yf 0 and that mean and variance of the conditional 
jump times exist. Figure [U illustrates how an SMP could be simulated: In state 
i {i = 0,1), select the next state j {j = 0,1) according to the distribution 
contained in the ith row of R. Draw a jump time from the conditional jump 
time distribution Fij(t). Set i := j and repeat the two steps above. 
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1- :F (t) 
00 00 



r ;F (t) 
01 01 



r : F (t) 
10 10 



r :F (t) 
11 11 



Fig. 1. A semi-Markov process with two states 



Using the BMAP notation, the semi-Markov kernel of an MMPP - viewed 
as an SMP - is simply obtained by 

Q{t) = f e^o'^Didu = - I)Dq^Di . 

Jo 

In analogy to MMPPs, we want to find expressions for the counting function 
N{t) of an SMP with initial distribution vq and semi-Markov kernel Q{t). We 
assume that at time t — 0 a, customer has arrived, i.e., a transition of the 
SMP has occurred. Let Q*{s) be the Laplace-Stieltjes transform (LST) of Q{t): 
Q*{s) = e~‘^*dQ{t). If we define the probability function r]i{t) = P{N(t) > 

i), one easily sees that its LST r]*{s) = uoQ*(s)*e. Furthermore, let Vk{t) = 
E[N{t)'^]. After some algebraic manipulations, one obtains the following terms in 
the Laplace-Stieltjes domain for the first three moments of the counting function: 

OO 

= ^oQ*{s){I - Q*{s))-^e ( 2 ) 

4{s) = voQ*{s){I + Q*{s)){I - Q*{s))-^e (3) 

I'lis) = voQ*is){I + 4Q*(s) + Q*(s)")(/ - Q*(s))-3e (4) 



An inverse Laplace transformation El yields the three moments of the counting 
function for a specific time to- With the computed values, one could determine 
the IDC (as defined in 0 ) for the considered SMP. However, the burstiness 
of SMPs is more often described by the index of dispersion for intervals (IDI), 
because its computation does not involve LSTs. The IDI is defined by J{n) = 



Var(V"_ Dj) 

— {Dn}'^=o is a stationary sequence of interarrival times of 
a (simple) traffic process. Let be the SCV of an interarrival time 

and pd{J) = (j = 0,1,...) the autocorrelation function of 

{Dn\. A simple computation provides the following relationship: 






J(n) = 
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It can be shown [2 that for the limiting indices of dispersion it holds: 



PD{j)j ■ 

One sees that both indices of dispersion take into account the autocorrelation 
function. This results in a better description of burstiness than it is accomplished 
by the SCV, which relies on an isolated interarrival period (first-order statistic). 
Note that for renewal processes J = I = c^l In order to compute the indices of 
dispersion for an SMP, it suffices to know the first two moments Eij and 
of the conditional jump times (instead of the distribution functions Fij{t)). In 
terms of the matrices K = (rijEij)ij and = {rijE^'^)ij, one obtains the 
following expressions for a stationary sequence generated by an SMP: 



/ OO 

J = lim J(n) = I = lim I{t) = ( 1 -b 2^^ 

n—*oo t—*oo \ 



E[Dn] = wKe (5) 

E[Dl] = (6) 

E[Dr,Dr,+u] = wKR^-^Ke (7) 

where w = (rio, rpi) is the stationary distribution of the embedded 

DTMC such that wR = w. By means of the above formulae, the values of 
c|), pdU), J{n), and J can be calculated. 

In the next section, we will also make use of the following representations of 
i?, K and K^'^\ if we regard an MMPP as an SMP: 



R = lim Q{t) = lim (e^“* - /)T>o = (-D(C^)Di 



t — >-oo 

K=(n, 



( 8 ) 



pOO \ / nOO \ nOO 

/ tdF^j{t)\ = ( / tdQij{t)\ = / te^°*Didt = Dq'^Di{9) 

Jo J ij ^J 0 J ij J 0 

iy(2) = |ry f t’^dFijitU =[ t^e^°^Didt = -2D^^Di . (10) 

\ Jo J ti Jo 



3 The MMPP(2)/G/1(/K) Queue and Its Output 

The input to the tandem network is an MMPP(2). Thus, the first queue can be 
treated as an MMPP(2)/G/1(/K) system using existing exact or approximate 
solution methods to obtain the desired performance indices of this queue like 
mean queue length, mean (virtual) waiting time, throughput, loss probability, 
etc. The proposed decomposition approach applies these MMPP(2)/G/1(/K) 
procedures to as many subsequent nodes in the tandem network as possible, i.e., 
as long as the intermediate traffic can be well approximated by an MMPP. If 
this assumption is violated, the decomposition algorithm falls back on existing 
procedures based on renewal processes and G/G/l(/K) queues |19I1()J . 

Since the analysis of MMPP(2)/G/1(/K) systems with respect to a wide 
range of performance measures has been covered in many excellent publica- 
tions mm , we will concentrate on the approximation of the output pro- 
cess of an MMPP(2)/G/1(/K) queue. We distinguish between the two cases 
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K = 1 and K > 1 (including K = oo), where the buffer of size K includes 
the server place. In both cases, however, we approximate the output process of 
the MMPP(2)/G/1(/K) system by SMPs with two states, which may then be 
converted to either an MMPP(2) or a renewal process. These form the input 
traffic process to the subsequent queue. 

3.1 The Output of the MMPP(2)/G/1/1 Queue 

As stated above, an MMPP(2) is an SMP(2). In ca, the output of the pure loss 
system SMP(2)/G/1/1 is proved to be an SMP(2) again. Instead of using the 
exact formulae, we simplify in favor of efficiency, since the output SMP(2) need 
only be described by the first two moments of its conditional jump times. 

Obviously, for a pure loss system, each interdeparture interval is composed of 
an idle period and a service, since any served customer must have arrived when 
the system was empty. Let the states of the output SMP(2) be 0 and 1 with 
the meaning that a move to the state i {i = 0,1) corresponds to a departing 
customer who entered the (empty) system when the arrival process, i.e., the 
MMPP(2), was in phase i. Figure |5] depicts the output SMP(2). 



Fig. 2. Output approximation of the MMPP(2)/G/1/1 queue 

The symbols /y {i = 0,1) denote the random variables of the idle periods in- 
cluded in the corresponding interdeparture time, while random variable S stands 
for the service time, whose distribution function is given by Fs(t). The distri- 
bution of the conditional jump times of the output SMP(2) is the distribution 
of the sum of the random variables lij and S. Rd is the transition probability 
matrix of the SMP(2) with components r^. 

To determine Rd, we introduce the matrix A = (a^) (i,j = 0, 1) defined in 
the context of MMPP(2)/G/1(/K) analysis as: 

Qij = P(service time ends with MMPP in phase j \ service began in phase i). 
From m. we have 




(t) 



POO -I POO 

A= e(^“+^i)‘dFs(t) = eTT / ■ {Dq + Di) 




At the end of a service time in the MMPP (2) /G/ 1/1 queue, which began in 
phase i, the output SMP(2) enters state i, while the MMPP(2) is in phase k 
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with probability a^fc (fc = 0, 1). The next customer will arrive at the queue after 
the idle period lij with the arrival process in state j {j = 0, 1) with probability 
{{—DQ^)Di)kj. However, this exactly means that the SMP(2) will move from i 
to j with this customer’s departure. Consequently, Rd = In other 

words, at the moment the output SMP(2) enters state i, the input MMPP(2) 
can be interpreted to behave as a two-state SMP, which is defined by the initial 
distribution {aio,au) and the semi-Markov kernel QA{t) = — I)Dq^Di 

such that Ra = Ka = D^^Di and = —2Dg^Di (see ® - 

m)- 

Taking into account the service times in the interdeparture intervals of the 
output SMP (2), we can write for this SMP (2): 

Kd = = AKa + RdE[S] = + A{-D^^)DiE[S] 

= AKf + 2{AKa)E[S\ + RdE[S% 

These formulae are derived by applying moment calculus to the matrix compo- 
nents, which represent mixtures of sums of random variables. Note that and 

are the first two moments of the conditional jump times of the output 
SMP(2) with distribution function EP_^g{t). They are easily obtained from the 

l'9\ 

(restricted) characterization of the output SMP by Rd, Kd, and K}^’ . 



3.2 Approximation of the Output of the MMPP(2)/G/1(/K>1) 
Queue 

For delay-loss {K > 1,K ^ oo) and pure delay {K = oo) systems of the 
MMPP(2)/G/1(/K) type, the output processes are of course more complex than 
an SMP with two states. Nevertheless, an SMP(2) approximation based on a 
busy period analysis appears quite intuitive (see also i)- Again, let the states 
of the output SMP(2) be 0 and 1. However, now a move to state 0 corresponds to 
a departing customer who arrived when the system was empty, while a move to 
state 1 relates to a departing customer who entered a non-empty system. In the 
latter case, the precedent interdeparture time must equal a service period with 
distribution function Es{t). Figure |3] shows the considered SMP (2) approxima- 
tion for the output process. It reflects the above fact by E^{t) = Es{t) {i = 0, 1). 




Fig. 3. Output approximation of the MMPP(2)/G/I(/K>I) queue 
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The random variable N stands for the number of customers in a busy pe- 
riod of the MMPP(2)/G/1(/K) queue, while and denote the ran- 

dom variables of the idle periods following a busy period with a single or more 
than one customer, respectively. The service period of the first customer in a 
busy period is incorporated in the conditional jump time distribution functions 
and Pj^jv>i)_|_ 5 (f)- Generally, the preceding idle period depends on 
the state of the arrival process at its beginning and the exact number of cus- 
tomers in the busy period that just ended. By probabilistic weighing, we will 
be able to abstract to and Summarizing, a transition from state 

0 to state 0 of the output SMP(2) indicates that the previous interdeparture 
interval comprises a single-customer busy period, whereas a path of the form 

{i occurrences of Is after 0) with a concluding zero 
contains a busy period with i -|- 1 customers. 

We will now determine the transition probabilities and the first two mo- 
ments of the conditional jump times. First, we observe that the SMP(2) model 
of the output process approximates the distribution of as a geometric distri- 
bution, which leads to the following expressions: 

P{N = 1) = = 1 - 

m] = i+"4 ■ 

Go 

On the other hand, let us assume for now that the employed procedures for 
the analysis of MMPP(2)/G/1(/K) queues yield the mean length of the busy 
period E[L] and the vector whose component is the stationary 

probability of ending a busy period with a single customer with the arrival 
process being in phase j {j = 0, 1). Then 



1-r, 



rD 

1-. !oi 

Go 



(= P{N = 1 )) = 



(= E[N]) = 



m 

if[5] ■ 



These equations are solved for and With = 1 (* = 0,1), the 

transition probability matrix Ro is completely determined. 

Since two conditional jump time distribution functions coincide with Fs(t), it 
remains to determine the first two moments of each sum + S and -|- 

S. This in turn is achieved, if we have found the first two moments of 
and At the end of a busy period with a single customer, the MMPP(2), 

i.e., the arrival traffic process, is distributed over its two states according to 
(win The length of the subsequent idle period equals the time 

to the next arrival. In fact, we are interested in the time to absorption of a 
non-ergodic GTMG with (defective) generator Dq on the transient states and an 
initial distribution of Xg^~^^ . Its density function is given [IS] by 

Xq e 



(N=l) ^0 



(N=l) 



(-Doe^o*e) 



f/(lV = l)(t) 
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leading to the following moments: 












(AT=1) ^ ' 



For the moments of we argue in a similar way where we determine an 

approximate initial distribution ^(wii) from = ^xq: 

1 






(tv>l) ^0 
a^o e 



(AT>1) 



(-Do^)e 



1 






,(tv>l)2£)-2g 



Here xq is defined as the vector of the stationary probabilities that a departure 
leaves the considered queue empty with the MMPP(2) in phase j {j = 0, 1). For 
either MMPP(2)/G/1/K(> 1) or the MMPP(2)/G/1 queue, xq, and if[L] 

have to be computed before obtaining Rjj, and 

J q£ |-]gg departure process. Finally, from these values and the first two 

(2) 

moments of the service time, Kd and K)j are easily constructed. 

For stable MMPP(2)/G/1 systems, xq = where the mean of- 

fered load p = itDiC ■ E[S] < 1 and the two-dimensional vector g is determined 
from gG = g, ge = 1 m- The stochastic matrix G is essential to the computa- 
tional procedures of the matrix-analytic approach of BMAP/G/1 queues. In our 
framework, its components {G)ij {i,j = 0,1) are interpreted as: 

{G)ij = P(busy period starting with MMPP in phase i ends in phase j) 
They can be efficiently computed by a specialized algorithm given in jDfJ . For 
queues with a finite buffer, xq may be found from an embedded DTMG analysis 
of the system at the departure instants. A system of linear equations of size 2K 
has to be solved [T]. If the MMPP(2)/G/1/K queue is stable and equipped with 
a large buffer, the approximate solution method of |S] might also be applied to 
yield Xg- In all cases, the mean busy period E[L] is obtained via 



E[L] = Pioss) period] 

1 - p(l - Pioss) 



p{^- Pioss) 1 

1 - p(l - Pioss) Xge 



xo{-Dq ^)e 



where pioss denotes the loss probability [T]. Finally, it holds 



X 



(N=l) 

0 



Xo{-Do^)Di / e^°^dEs{t) 

XqG Jq 



where 9 = ma,Xi{{—Do)u} and P = ^Dg+I- The last expression suggests the effi- 
cient method of randomisation to compute m- Note that ^xo(— T^g ^)Di 
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is the distribution of the arrival process at the moment the first customer of a 
busy period enters the system. The first integral contains the probabilities that 
no other customer arrives before the first customer’s service is finished. 

4 Conversion of SMP(2) to MMPP(2) 

In the previous section, the output traffic of a node is approximated by an SMP 
with two states. Its transition probability matrix and the first two moments of the 
conditional jump time distribution functions are given. This traffic process could 
serve as the input to the subsequent queue in the tandem network. However, an 
efficient algorithm to obtain the quantities required for the output approximation 
of this SMP(2)/G/1(/K) queue is still lacking. 

On the other hand, the matrix-analytic approach for MMPP(2)/G/1(/K) 
systems, which was already applied to the first node, possesses the desired flex- 
ibility and efficiency. So, the proposed algorithm approximately converts the 
SMP(2) to an MMPP(2). Since, for MMPPs, the SGV is always larger than or 
equal to unity, a successful procedure must verify this condition for the SMP 
first. Otherwise, its mean rate Xd = SGV used 

to characterize a renewal process, which is fed into the next node. This em- 
beds the proposed algorithm into the existing decomposition framework based 
on GI/G/1(/K) nodes. 

Roughly speaking, by converting an SMP(2) to an MMPP(2), we mean that 
four parameters related to the counting function of both processes are matched. 
First, we determine the limiting index of dispersion I of the SMP(2) and its first 
three moments of the counting function at a specific time to- From (0, ©, and 
(0, we see that the knowledge of Rd, Kd, and allows to calculate J, while 
the computation of if[7V(fo)], and E[N(to)^] actually require at least 

the LST of the semi-Markov kernel of the SMP (see 0, (||), and (|4|). 

We construct QJj(s) from Rd, Kd = {rfjEij), and by fitting a 

Gamma distribution to the first two moments of the conditional jump times. 

Let the conditional jump time distribution Fij(t) be characterized by the 

( 2 ) 

first two moments Eij and if L ^ . The two parameters a and A of the Gamma 
distribution function E{t) = f* X°^x°‘~^e~^^dx are determined from the 
equalities for the mean and SGV: 



for F*j{s) in Q*d{s). 
Whenever Fij{t) = Fs{t), we insert the exact LST of Fs{t) if available. Only in 

this case, — 1 might be zero, i.e, Fij{t) (= Fs{t)) describes the deterministic 

ij 

distribution, to the moments of which a Gamma distribution could not be fitted. 



^ 



( 2 ) 






- 1 = 



1 



Thus, we substitute the LST of F{t), F*{s) 
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An inverse Laplace transformation El, delivers E[N(to)], E[N{to)'^], and 
A[iV(to)^] for time to- Note that to could take any value greater than zero; 
experiments suggest the heuristic value to = 10 • ^ useful choice. 

In a next step, these three quantities derived for the SMP(2) are adopted 
as the moments of the counting function of a stationary MMPP(2) in (0,to]. 
From (H) for the MMPP(2) and by E[{N{to) - A[tV(to)])3] = E[N{tof] - 
3A[tV(to)]if[A^(to)^] + 2(A[A^(to)])^, we obtain the index of dispersion for counts 
/(to) and the third centralized moment if[(A^(to) — if[A^(to)])^], respectively. 

In a final step, a procedure developed by Heffes and Lucantoni |2] computes 
the four rates of the MMPP(2) by matching the following four characteristics: 

— the fundamental arrival rate in the interval (0,to]: , 

— the index of dispersion for counts /(to), 

— the limiting index of dispersion for counts /, 

— the third centralized moment of the counting function at time to . 

Thus, those features of the traffic process which heavily influence the perfor- 
mance measures of a queueing system are assumed to be preserved. 

5 Approximate Analysis of Tandem Networks 
with and without Customer Loss 

In this section, two tandem queueing networks with MMPP(2) input are de- 
composed into individual nodes. Each node is analysed in isolation by a method 
which depends on the number K of buffer places the specific queue provides. For 
MMPP(2)/G/1 systems, Lucantoni’s method is applied [l3j. For queues with a 
moderately sized capacity, we directly solve the Markov chain embedded at de- 
parture epochs, while for a large capacity K we prefer the approximate technique 
developed for the more general BMAP/G/l/K system in [^. 

In all three cases, the cited methods deliver the quantities needed to approx- 
imate the output of the considered queue as an SMP(2) according to Sect.[3l 
After having been converted to an MMPP(2) (see Sect.|4]), this traffic process 
becomes the arrival process to the next node. This node can also be analysed 
as an MMPP(2)/G/1(/K) queue as appropriate. In this manner, we treat all 
queues individually starting from the first one and proceeding to the last one. 

Various performance parameters can be computed for the different queues, 
including throughputs, mean waiting times, and loss probabilities (in case K < 
oo). In this paper, we will only give results for the stationary mean queue lengths 
at arbitrary time and compare these values to those obtained by simulation. Since 
the embedded Markov chain method for queues with a capacity only yields the 
mean queue length at instants of customer departure, we approximate the mean 
queue length at arbitrary time by means of formulae given in [B] . All simulations 
were performed by means of software tool TimeNET [20] with 99% confidence 
level and a maximum relative error of 5%. 

The first example is a dual tandem queue without capacities. Gustomers 
arrive according to an IPP and are served by two single servers in series with 
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a uniform distribution on the interval [0, 2] and a deterministic service time of 
1.0, respectively. An IPP - as a special case of an MMPP(2) - is stochastically 
equivalent to an hyperexponential renewal process. Thus, we can compare the 
proposed algorithm not only to simulation, but also to decomposition based on 
renewal processes, namely the method described in |19| . 

In the MMPP(2) setting, the parameters of the IPP are: vq = 0.9, ri = 
0.1, Ao = 5.0, Ai = 0.0. The mean arrival rate and SCV of the equivalent re- 
newal process are 0.5 and 10.0, respectively. Only these two values flow into the 
decomposition algorithm in m- 

Table [l] lists the values for the mean queue lengths at both queues obtained 
by the three mentioned solution methods. Both decomposition algorithms are 
also implemented in TimeNET j^. Next to the approximate analytical values 
obtained by the two decomposition approaches, their relative errors with respect 
to the simulated data are given. The relative errors of around 55% for the mean 
queue lengths reflect the fact that decomposition based on renewal processes 
is not especially designed for hyperexponential renewal input with large SCV 
(cf^ > 1). 



Table 1. Mean queue lengths for the dual tandem queue with IPP input 



Node 


SIM 


DEC-renewal 


DEC-SMP/MMPP 




value 


conf. interval (-/-f) 


value 


Re(%) 


value 


Re(%) 


1 


4.9254 


0.1951 


2.2367 


-54.6 


4.9249 


-0.01 


2 


1.1871 


0.0557 


1.8564 


+56.4 


1.1045 


-6.97 



For the first node in the tandem - actually an MMPP(2)/G/1 (more precisely 
IPP/G/1) queue -, the mean queue length of the decomposition approach based 
on SMP(2)s and MMPP(2)s naturally almost coincides with the simulated value. 
Approximating the intermediate trafflc process between the two nodes by an 
SMP(2)/MMPP(2) instead of a renewal process has reduced the relative error 
for the second node from -1-56.4% to a tolerable —6.97%. 

Figure U] depicts another tandem network, which consists of eight queues 
either with an infinite or a finite buffer. The exact buffer sizes and the service 
time distribution functions of the single servers can be found in the figure. The 
arguments for the deterministic, exponential, uniform, and Erlang distributions 
stand for the fixed delay, the rate, the range, and the mean together with the 
number of phases, respectively. Here, the arrival to the first queue is a true and 
bursty MMPP(2) with the parameters tq = 1.0, ri = 0.005, Aq = 100.0, Ai = 0.1. 
These parameters of the MMPP result in a mean arrival rate of about 0.6 and 
an SCV of about 10.3. 

Again, we compare the mean queue lengths for the presented decomposition 
algorithm and simulation. Table [5] shows that the relative errors are below 10% 
- except for the second node where the error amounts to -12.8%, which might 
still be regarded acceptable. In this example, all SMP(2) output approximations 
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MMPP - ‘ I ‘ 

(1,0.005, 100,0.1) exponential(2.0) erlang(0.2,2) uniform(0.0,1.0) deterministic(0.5) 

detenninistic(0.5) uniform(0.0,0.5) deterministic(0.5) exponential(2.0) 



Fig. 4. An eight node tandem queueing network 



exhibit SCVs larger than unity, so that all queues can finally be treated as 
MMPP(2)/G/1(/K) sytems. 

Considering the quick response times, the proposed decomposition method is 
attractive, especially when one has to inspect many different configurations very 
quickly. Note that the simulation component of TimeNET ran for almost half 
an hour on an Ultra SPARC workstation with 300 MHz to compute the results 
of Table 121 - with the admittedly rigid requirements mentioned above. 



Table 2. Mean queue lengths for the eight node tandem queueing network 



Node 


SIM 


DEC-SMP/MMPP 




value 


conf. interval (-/-I-) 


value 


Re(%) 


1 


34.8415 


1.3312 


34.9112 


+0.2 


2 


2.5924 


0.1156 


2.2599 


-12.8 


3 


0.2341 


0.0105 


0.2232 


-4.7 


4 


0.1061 


0.0043 


0.1139 


+7.A 


5 


0.3801 


0.0162 


0.3813 


+0.3 


6 


0.3888 


0.0147 


0.3997 


+2.8 


7 


0.3904 


0.0195 


0.3899 


-0.1 


8 


0.3661 


0.0182 


0.3551 


-3.0 



6 Conclusions 

A decomposition approach for general tandem queueing networks with MMPP (2) 
input has been presented. As common for most decomposition algorithms, the 
proposed method delivers approximate results for (stationary) performance mea- 
sures, like mean queue lengths, very quickly, while the MMPP(2) input increases 
the range of applications. 

The proposed approach extends the existing framework of decomposition. 
Whenever possible, internal traffic processes are described by SMP(2)s and 
MMPP(2)s instead of renewal processes. Thus, lag-Acorrelations of the inter- 
departure/-arrival times are accounted for to some extent, which yields a better 
accuracy of the performance measures. Our experiments, of which representative 
examples were given in this paper, indicate that relative errors may be expected 
to be around 10% - of course, often less - which may be tolerable in many 
situations considering modeling assumptions. 
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The following extension of this work is nearby: Procedures for the split- 
ting and superposition of traffic processes could be incorporated. Thus, general 
queueing networks with Markovian routing might be approximately solved. 
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Abstract. This paper presents an efficient equilibrium solution algo- 
rithm for infinite multi-dimensional Markov chains of the quasi-birth- 
and-death (QBD) type. The algorithm is not based on an iterative ap- 
proach, so that the exact solution can be computed in a known, finite 
number of steps. The key step on which the algorithm is based, is the 
identihcation of a linear dependence among variables. This dependence 
is expressed in terms of a matrix whose size is finite. The equilibrium 
solution of the Markov chain is obtained operating on this hnite matrix. 
An extremely attractive feature of the newly proposed algorithm is that it 
allows the computation of approximate solutions with any desired degree 
of accuracy. The solution algorithm, in fact, computes a succession of 
approximate solutions with growing accuracy, until the exact solution is 
achieved in a finite number of steps. 

Results for a case study show that the proposed algorithm is very efficient 
and quite accurate, even when providing approximate solutions. 



1 Introduction 

The search for efficient algorithms leading to the equilibrium solution of multi- 
dimensional Markov Chains (MC) received a strong impulse with the matrix- 
geometric method proposed by Neuts in [J for the computation of the steady- 
state probability distribution for ergodic MCs whose infinitesimal generator (or, 
equivalently, whose probability transition matrix) has block structures with re- 
peating rows. Many of the studies which followed, built on this approach; they 
mainly focused on the solution of quasi-birth-and-death (QBD) processes, and 
of matrices in the G/M/f form [‘2|,*f ] . Fewer are the solutions proposed in the lit- 
erature for models based on the M/G/f paradigm. In [Ij the solution is derived 
by the iterative computation of an intermediate matrix. In [H] the solution is 
based on a recursion derived from the permutation of rows and columns of the 
matrix. 

In this paper we propose an efficient algorithm for the equilibrium solution 
of multi-dimensional MCs in the QBD form. The newly proposed algorithm is 
not iterative, and its cost is polynomial in the dimension of the blocks in the 
matrix. 



B.R. Haverkort et al. (Eds.): TOOLS 2000, LNCS 1786, pp. lOl- fTTfl 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 
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The great novelty, and one of the most attractive features of the proposed 
approach consists in the possibility of computing approximate solutions with 
very low cost. The proposed algorithm, in fact, is based on the computation of a 
finite succession of solutions, ordered according to increasing degree of accuracy. 
Each element of the succession improves the accuracy of the previous element; 
furthermore, it is derived from the previous element itself in an ’’incremental 
way” thus not losing any of the computations already performed. This feature of 
the proposed procedure is particularly appealing in both fields of design and per- 
formance evaluation. Indeed, low-cost solutions, even if approximate, are usually 
desirable in the preliminary phases of analysis and design, when a system needs 
to be preliminarly investigated. Having obtained a feeling about the behavior of 
the system, the analyst can better tailor the next investigation phases preceding 
a deep and expensive analysis. 

The advantage of the proposed technique over iterative procedures applied 
to the same class of problems is that the exact solution can be computed in a 
finite number of steps, and that approximate solutions to any desired degree of 
accuracy can be obtained more efficiently. The uniqueness of this characteristics 
of our approach makes it quite interesting and promising. 

The paper is organized as follows. In Section |2] after providing the problem 
formulation we explain the main ideas on which the proposed approach is based, 
and we derive the exact solution. The procedure for the computation of the ap- 
proximate solutions is then detailed in Section [21 As an example, in Section |T] 
we solve a system of two queues and we present some results about the accu- 
racy of the approximate solutions, and the computational cost of the procedure. 
Conclusions follow in section js) 

2 The Basic Solution 

The problem we tackle can be formalized as follows. We consider a multi- 
dimensional MC, whose state space S is determined by the state variable s = 
(iti, M 2 , • • • , tin, v), where each Ui can assume a finite number of different values, 
say li, with Jlfc=i ~ ^ assume all values in the set of non-negative 

integers: t;€{0,l,---}. 5 can be partitioned into subsets Si, where i refers to a 
specific value for v: Si = {(tii, tt 2 , • • • , ti„, t;)|t; = i}, with |5i| = I, Vt. 

Let n be the vector of the steady-state probabilities. The problem we intend 
to solve is expressed by the matrix equation: 



HA = 0 (1) 

where A is the infinitesimal generator matrix for a continuous-time MC, or 
where A = I — P for a discrete-time MC with transition probability matrix P. 
According to the considered partition of the state space, the steady-state vector 
n is also partitioned into sub- vectors, tt^, with dimension 1 x associated with 
the subsets Si of the state space S: 

n = [tTo, TTl, * * * , TTj, * * ■]. 
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Correspondingly, matrix A is composed of blocks of size I x 1: 



A = 



Bq Bi • • • Bj-_i 0 • • • 

Aq Ai • • • Aj>_i A.J- 0 • • • 

0 Aq Ai • • • At-_i At- 0 • • • 

• • • 0 Aq Ai • • • At-_i Aj- • • • 



(2) 



We require for the applicability of the proposed algorithm that block Aq is 
non-singular, and that all rows, except the first one, comprise identical blocks, 
as shown in 0- The requirement that Aq is non-singular makes the proposed 
method suitable only to a sub-class of problems in the QBD form; however, many 
interesting applications belong to this sub-class of problems. 

The procedure we propose in this section is aimed at computing II and 
provides the exact solution in a finite number of steps. In the next section, we 
derive a version of the procedure which is capable of getting both the exact and 
a family of approximate solutions with increasing degree of accuracy. 

Consider the equation obtained by © using the first column of A: 



ttoBo + ttiAq = 0 



we get 7Ti as: 



TTi = -ttoBoA 



-1 

0 



and we eliminate tti by substituting it into all other equations, obtaining the 
reduced system: 

n(i)A(i) = 0 



where II^^^ = {ni} with i yf 1, and matrix A*^^) has the structure 



Ad) 



Bd) 


By 




b(^) 


0 


Ao 


Ai 




At-_i 


A^ 0 • • • 


0 


Ao 


Ai 




> 

1 

> 

O 




0 


Ao 


Ai 


’ ’ ’ 



The following relations hold: 

I Bd) = B,+1 - BoAq 1A,+1 z = 0 ,l,---r -2 

\ B<y = -BoAq-IA, 



( 3 ) 



Introducing the notation Bd) = [Bd) B^ • • • B^^ ] and B^) = [Bq Bi • • • B^-i] 
we can re-write m as 



Bd) = Bd)T 



( 4 ) 
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where 



< 

1 o 

1 


(M 

1 o 
1 


-Ao'Aa 


’ ’ Aq A,._i 


1 o 
1 


I 


0 


0 


0 


0 


0 


I 


0 


0 


0 


0 


0 


I 


0 


0 


0 


0 


0 


I 


0 



Observe now three key aspects: 

1. Except for the boundary conditions, and are equal. 

2. The structures expressing the boundary conditions, and have the 
same dimension. 

3. A linear dependence holds between B^°) and B^^^. 

From these considerations we can derive the further reduced system A^^^ 
from A^^) in the same way as A*^^^ was derived from A^°\ by means of the same 
matrix T. Repeating this procedure, we get for a general value i the reduced 
system A*^®) with 

g(i) _ _ g(0)rjii 

and 

TTi+l = -7ToB[,*^Aq ^ 

= -7ToB«EAo 1 

= -7ToB(°)t®EAo 1 (6) 

where E is a /r x / matrix which selects Bg*^ from B^®); 

E^ = [I; 0 • • • 0] 

The computation of the term ttq in m is a major task and it is also the main 
objective of the procedure which will be described below. 

Observe that in order to get the solution, we can now operate on the fi- 
nite matrix T instead of operating on structures whose dimension is infinite. 
For the following derivations, we need some preliminary considerations on the 
eigenvalues of T, which are given below in: 

Theorem 1. Consider matrix T derived from A as in Under the assump- 
tions that i) T has all distinct eigenvalues, and ii) T has only one eigenvalue on 
the unit disc; T has 1 — 1 eigenvalues strictly outside the unit disc, one eigenvalue 
equal to 1 and l{r — 1) eigenvalues strictly inside the unit disc. 

The proof is omitted for the sake of brevity. 

Let {Ai, i = 1,2, - ■■ ,lr} be the set of eigenvalues of T and let £m = {Ai : 

I Ail > 1} and Em = {Ai : |Ai| < 1} be a partition of this set. Denote with Vi 



Exact and Approximate Solutions for a Class of Infinite Markovian Models 



105 



the eigenvector associated with Xi and partition the set of all eigenvectors in: 
£Vm = {vi : Xi G £m} and £Vm = {vi : A* G £m}- 

As some Xi are not smaller than 1 , the powers of T, T\ diverge for i ^ oo. 
Nonetheless, the variables n£s derived by these powers (see ®) tend to 0 for 
i —>■ oo, and the sum of the tt^’s converges, since we are dealing with ergodic 
MCs. We conclude that in the factors multiplying T® must cancel the growing 
components of T® . 

We thus search for a linear transformation of T in which we can isolate the 
subspace generated by the eigenvectors in £Vm to which the growth of T® is 
due. Once we have such a transformation, we are able to derive the convergence 
conditions by imposing that the coefficients of the diverging terms are null. 
As we shall see in the following sections, from these conditions and from the 
normalization, we are able to derive a solution for ttq, and from that obtain the 
solution for all other Tr^’s. 

Summarizing, two conditions determine the solution: 

— ergodicity condition: limi^oo T^i — 0 

— normalization condition: IK*II ~ where || • || denotes the sum of 

the elements of a vector. 

They require respectively that the MC be ergodic and that the solution of the 
linear system be a probability vector. In particular, by imposing ergodicity we 
get I — 1 equations for the I unknowns of ttq and, from the normalization, we 
obtain another equation. 



2.1 Ergodicity 

Let us assume the following transformation of T holds: 



T = QLQ 1 



( 7 ) 



such that L has the form: 



L = 



Li,i Li^2 

0 L2 2 



( 8 ) 



where Li,i is Ixl and L2.2 is l{r— 1) x l{r— 1), the sets of the eigenvalues of Li^i 
and of L2,2 are respectively £m and £m and Q is the transformation matrix. 

The transformation of T is crucial for efficiency, and it constitutes the kernel 
of the procedure. We discuss it in detail in the following section. For the moment, 
we derive the solution assuming that a proper transformation has been found. 

We observe: 



T® = QL®Q-i = Q 



^1.1 

0 



LI 

L®, 



Q 



the blocks on the diagonal are obtained as the powers of blocks Li.i and L2,2; 
while the block L® is a combination of the other two. Therefore, since the eigen- 
values of L14 and L2,2 are respectively £m and fm, we are capable of identifying 
in T® a portion which diverges, given by [L^ L®], and a portion which tends 
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to 0 when i ^ oo. Observe that the diverging part is restricted to the the upper 
I rows of T® . 

We now substitute © in 

7T,+ 1 = -7ToB(°)t*EAo 1 

= -7ToB(°) (QL*Q-i) EAq 1 (9) 

For the ergodicity condition we have: 

lim f-7ToB(°)QL*Q-^EAo = 0. 
i — *oo V / 

Now, since Q^^EAq ^ is finite with full rank, we can write 

lim (ttoB^o^QL*) = 0 (10) 

i^oo V / 

From a partition of Q consistent with the partition of L, we write QL* in m 
as: 



Ql,l Qi,2 




Lli U ■ 


Q2,1 Q2,2 




0 1*2,2 



where we notice that Qi,i and Q 2 ,i are the only blocks multiplying i and L(,. 

From this and from the previous considerations, that 2 tends to 0 while 
L\ ^ and L* diverge as i tends to infinity, we derive that is equivalent to: 

7ToB(°)QE = 0 (11) 

This relation provides a set of equations for the solution of TTg. We have the 
following: 

Theorem 2. Equation m has I — 1 linear independent equations for the I 
unknowns o/tto- 

The proof is omitted for the sake of brevity. 

Solving CH) we get the steady-state conditional probability of being in the 
states of subset Sq, given that the MC is within 5 q; or, equivalently, we get a 
vector 7To, that is a multiple of ttq: 



7To = ano 



( 12 ) 



2.2 Normalization 

We now derive a from the normalization condition “ II ’’’*11 ~ ^ 

CXD 

IKo - ^7 ToB(o)QL*Q-1EAo i|| = 1 

i=0 



( 13 ) 
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As observed before, L® is composed of a portion which converges and a portion 
which diverges, for i ^ oo. The diverging part of L% h\ and L® , is multiplied by 
elements that, as required by the ergodicity condition (El) are equal to zero; we 
can thus conclude that they do not influence the infinite sum at all. Equation El 
is then equivalent to: 



IItto -^ 7ToB(°)QM®Q-iEA[7i|| = 1 

i=0 



where: 



The sum can now be solved: 



M = 



0 0 

0 ^ 2,2 



|Ko-7ToB(°)QMQ-1EAo1|| = 1 



with 



and we obtain: 




0 

1^2,2)”^ 



oIItto - 7ToB(°)QMQ-iEAn i|| = 1. 



(14) 

(15) 



This result, jointly with (I12H . gives the solution for ttq. 

From ttq, applying ([BJ, we get up to the desired value of i. 

Summarizing, the algorithm to obtain the desired exact solution is composed 
of the following steps: 



The algorithm 



1 Operate the transformation of T in ©. 

2 Get 7To from the ergodicity condition through (El). 

3 Get a from the normalization condition through El). 

4 Compute ttq = afro. 

5 Get all the vri’s up to the maximum desired value for i by applying ®. 



The decomposition of T as in step 1 has a key role in the whole procedure; 
once we have performed it, the solution can be simply computed. This decom- 
position is also the most demanding one in terms of computational cost. In the 
following section, we propose a simple and efflcient procedure for this crucial 
step. 

3 The Approximate Solutions 

In the previous sections we formalized the problem and we showed how the 
exact solution can be easily obtained once we have a transformation of T as in 
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©• In this section we propose a procedure to implement this transformation. 
We propose this procedure not only for its simplicity and efficiency, but also 
and above all, because it has a very appealing property: while performing the 
transformation, a family of approximate solutions may be derived with very low 
cost. In fact, the procedure obtains the transformation of T by means of the 
computation of a sequence of matrices which are approximations of T in the 
sense that their eigenvalues are approximations of those of T. An approximate 
solution for ttq corresponds to each of these matrices, and the more accurate are 
the estimates of the eigenvalues of T the more accurate is the corresponding 
estimate of the exact solution. 

The use of the procedure is thus twofold. On the one hand it efficiently 
provides the exact solution; on the other hand it can be used to obtain, at even 
lower costs, a set of approximate solutions with increasing accuracy. 

In the following, we first explain the procedure itself, and in particular how 
the transformation is derived by means of the sequence of matrices mentioned 
above. Then, we discuss how to choose one of these matrices in order to have 
a good approximate solution. Finally, we show how the approximate solution is 
derived. 

3.1 The Krylov Subspaces Procedure 

The procedure we propose is capable of providing an approximate solution with 
low computational cost. Moreover, with a cost comparable to one of the tech- 
niques mentioned above, it produces the exact solution. It is based on the con- 
struction of an orthogonal basis for T by means of projections on Krylov sub- 
spaces [6]. 

A Krylov subspace is constructed from the space associated with a matrix 
Q by means of the matrix itself and of a randomly chosen vector, let it be v: 
Am(Q,v) = spanjv, Qv, Q^v, • • • , the parameter m determines the 

dimension of the subspace. We work with the Krylov subspaces of T: 

/Cm(T,vi) = span{vi,Tvi,T^Vi,-- •,T"‘“Vi} (16) 

fixing a random vector Vi and letting m vary. The choice for Vi influences the 
performance of the procedure we are going to present; however, this influence is 
still under investigation. 

Given a Krylov subspace /Cm as in (US we use an algorithm due to Arnoldi 
j7] to build a basis for it. The algorithm gets the basis for /Cm in nr steps. 
It can be shown that the following relation holds: 

V^TVm = (17) 

where Hm is an (m -|- 1) x (m + 1) upper Hessenberg matrix, and is a 
Ir X (to -|- 1) matrix. 

The Krylov subspace (KS) projection method is given below. We denote with 
Hm(*,j) tbe element (r,j) of H^, with Vj the j— th column of Vm and with Wj 
the j— th column of a temporary matrix used along the procedure. 
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KS Projection method 



1 for j = 1, 2, ■ ■ ■ , m do 

2 Wj = Tvj 

3 for i = 1, 2, • • • , j do 

4 Hm(i, j) = wJ • Vi 

5 Wj = Wj — Hm(*,i)vi end 

6 + l,j) = \\Wj\\2 

7 Vj +1 = Wj/H^(j + l,j) end 



A very interesting property of this basis, is that the m + 1 eigenvalues of 
Hm result to be good approximations of the (to + 1) largest eigenvalues of T 
m- Since we need for our solution the I biggest eigenvalues of T, we exploit 
this property and work on instead of working directly on T. The advantage 
consists in the reduction of the computational cost, which is mainly due to two 
factors: 

— Urn has smaller dimension than T 

— Um has an upper Hessenberg structure. 

We pay this improvement in efficiency with a loss in the accuracy of the eigen- 
values we operate with. This trade-off is decided upon the value of to. The larger 
is TO, the closer are the eigenvalues of to those of T, but the higher is the 
computational cost. 

The efficiency in getting the set of approximate solutions derives from the 
incremental way in which the basis construction proceeds. Without losing the 
computations which have already been performed, V^-i-s and H^-i-s can be ob- 
tained by concatenating to the already existing Vj„ and some new rows and 
columns, computed with s additional steps of the above procedure. In particular 
Vm-i-s is built by concatenating s more columns to the already existing V^,, and 
Hm-i-s is built by concatenating s more rows and columns to H^. Fig. [T] illus- 
trates this property; the new rows and columns computed during the s additional 
steps are shaded in the figure. 

3.2 Finding a Proper Value for m 

Letting to grow, we get a sequence of matrices Um and with growing di- 
mension. Correspondingly, the eigenvalues of H^, with i = 1, 2, • • • , to -|- 1, 
get closer to those of T. For our purposes, we need estimates for the I largest 
eigenvalues of T, and therefore the minimum value for to that we are going to 
consider is toq = 1. 

For TO = /r — 1, Um is a Zr x Ir matrix and it is similar to T, is an 
orthogonal basis of T and the coincide with the eigenvalues of T. We say 
that Tom = Zr — 1 is the maximum value that to may assume, and we use the 
simplified notation: H^m = H, = V. 
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Fig. 1. Incremental construction of Hm+g and from Hm and 



Within this range of values, we need to select a proper m which 

balances the opposite needs of high accuracy of the results and low cost of 
computations. We proceed as follows. 

We initially set m = mo and we run the KS procedure; we then stop to 
check the achieved approximation and decide whether the chosen value of m is 
acceptable, or a larger one should be used. In the latter case we continue the KS 
projection method for s more steps and check again. 

The criterion to evaluate how close the eigenvalues of are to the ones of T 
has to be simple and computationally inexpensive in order for the procedure to 
be efficient. For this reason we use a procedure which avoids the explicit compu- 
tation of the eigenvalues of H^. We exploit, instead, the knowledge that A; = 1 
for T, and that the associated eigenvector is v; = [1 1 • • • 1]^. Correspondingly, 
for Urn, we have the eigenvalue with the associated eigenvector and 

they respectively tend, as m gets closer to uim, to 1 and to a vector w;. w; is 
easily derived from v/ and from V (remember that V = when m = ttim) 
as: 

w/ = V^v/ (18) 

In fact, when m = ttim, we have: 

T = VHV^ 

Tv; = A;V; = V; 

Tv; = VHV^V; = V; 

HV^V; = V^V; 



from this last equation we notice that V^v; is the eigenvector of H associated 
with A; = 1. 

The adopted accuracy check criterion derives from the definition of eigenvalue 
and from the above considerations that a|™^ and tend to 1 and to w; as 

m grows. For a given m, the following relation holds: 

= 0 



(19) 







Exact and Approximate Solutions for a Class of Infinite Markovian Models 



111 



Approximating with A/ = 1, and with V^v; = we get: 



H 



( 20 ) 



as m approaches tom- Notice that with respect to (uni two sources of error are 
introduced: both the eigenvalue (a|"*^) and the eigenvector of Hm are 

approximated. Equation m, in fact, holds exactly for m = ruM only, but not 
for m < iriM- 

Using the succession in the index m of the vectors and A; = 1, we choose 
a proper value for m by imposing that a precision level P is reached in (I20II . The 
precision level P may be evaluated according to different definitions of error; the 
one we use here is the square error defined as follows. Let we 

have: 



- UPO = 




(21) 



We can now specialize the procedure for the computation of m. 



Computation of m 



1 set m — mo and i = 1 

2 apply the KS projection method from i to m 

3 compute = V^v; 

4 derive 

5 if err ( z^™^ — ^ i = m + l, m = m + s and goto step 2 



3.3 Getting the Approximate Solution 

Once we have and with a properly chosen value for m, we still do not 
have a transformation as required in dZJ. However, the small dimension of 
and, above all, its Hessenberg structure give great advantages in finding such a 
transformation for Hm instead of T. 

Let Q be a transformation for analogous to that for T in 0: 

= QKQ-i 

TKi,! Ki,2‘ 

[ 0 K 2 . 2 , 

where Ki^ is an / x / block and K 2.2 is {m + 1 — ^) x (m + 1 — Z), and where the 
eigenvalues of K 1.1 are the approximations of those in £m- This transformation 
can be done with a traditional approach such as the QR decomposition [^. 
Usually these methods greatly benefit of the Hessenberg structure of the matrix 
they are dealing with. 



(22) 

(23) 
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Substituting (|22] in (flTll we get: 

V^TV„ = QKQ-i 
TV™ = V™QKQ-i 



and we can easily derive from the ergodicity condition m- 

7tB(°)V™QE = 0 



( 24 ) 



Notice that the experimental results presented in the next section show that the 
approximation introduced by using H™ provides very accurate results even for 
small values of m. 

Up to now we focused on the computation of the eigenvalues of T in from 
which we can get ito . In order to have the normalization constant a, we need the 
portion of the space associated with T which is generated by the eigenvectors in 



The procedure we adopt consists in extracting from the space of T, one at a 
time, the subspaces generated by each eigenvector in £Vm, so that the partition 
which results at the end collects the subspace generated by £Vm on one side 
and what remains on the other. In this way, we decompose the matrix by using 
the eigenvalues in £m and the corresponding eigenvectors in EVm only; this 
procedure is called deflation method [Sj. 

Let Vi e £Vm be the eigenvector associated with eigenvalue Ai G £m- We 
transform T in: 






A bi 
0 



where is {Ir — 1) x {Ir — 1) and bi is 1 x {Ir — 1). It can be shown that the 
transformation matrix Pi such that: 



PiTPC^ = 



is given by the Householder matrix |2] 

Pi = I — 2wiwf 

where wi = c(vi — ei), with ef = [1, 0, • • • , 0] and c is such that ||wi ||2 = 1. 
Notice that has the same eigenvalues of T except for Ai. 

We then iterate this procedure and apply it to for the eigenvalue A 2 G 

{£m — Ai}. We derive W 2 from the eigenvector of which is obtained from 

that of T as P 1 V 2 . 

Once we have repeatedly applied the procedure for all the eigenvalues in £m 
we have transformed T in: 



Ai 


X • 


• • X 


X 


0 


A 2 * 


• • X 


X 








X 


0 


0 


■ ■ A/ 


X 




0 




\J{1) 
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which is in the form of m with Li,i upper triangular (with Ai along the main 
diagonal) and L 2,2 = 

The intermediate matrices do not have to be computed. The overall 
transformation is in fact given by P = P/P;_i---Pi and the product can be 
updated at step i with little cost without knowing the matrices with j < i. 
Let Qi = Pi • • • Pi, we have 

Qi+l — Pi+lQz 

= (I - 2wi+iW^i)Qi 

= Q^- 2wj+iw^iQj 

Summarizing, we can write the following procedure: 



Deflation method 



1 P = I 

2 for i = 1, 2, ■ • ■ , Z do 

3 Wi = Vi — 6i 

4 Wi = Wi/||Wi||2 

5 P = P-2wiwfP 



At the end of the procedure we have PTP ^ = L where L is as in 0 and 

P = Q b 

The deflation method we showed above requires the eigenvectors in SVm- We 
could get them from T and the estimates of its eigenvalues which are provided by 
Hm, Ai, through equation (T — Ail)vi = 0. Since this is rather costly in terms of 
required computations, we choose a slightly less accurate approach which further 
reduces the cost of the procedure. We directly approximate the eigenvectors Vi 
with the columns of Vj„Q; these, in fact, in a similar way as the eigenvectors, 
give the transformation for T. 

The computational cost of the procedure is briefly outlined as follows. The KS 
method has complexity 0{lrw?) and the most costly operation that follows KS 
is the transformation of H^, which benefits both from the Hessenberg structure 
and from the small dimension of H^. Adopting, for example, a standard QR 
decomposition the complexity is 0{rn?) against a complexity for the 

same operation on matrix T. 

4 An Example 

The purpose of this section is to evaluate, in a practical case of interest, how the 
approximate solution obtained with the proposed procedure may influence the 
results of the performance evaluation of a system. 

We consider a system with two single-server queues, shown in Fig. [21 The first 
queue has finite capacity Bi = 10; the second has infinite capacity. Customers 
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A-2 




Fig. 2. System with two queues 



arrive in groups which may comprise up to g customers. When a group arrives 
at the first queue and there is not enough space to accommodate all its clients, 
the whole group moves to the second queue. Interarrival times for groups, like 
service times, are random variables with negative exponential distribution and 
rates at the two queues respectively equal to Ai, A 2 and p,i, /i 2 . 

The state variable for this model is (ni,n 2 ) where is the number of cus- 
tomers in queue i, with ni = 0, 1, • • • , i?i and ri 2 > 0. The block dimension is 
Z = Bi -|- 1 = 11, while the maximum group size determines the band extension 
r, r = g + 1. 

We choose as a typical performance index for this system the average number 
of customers in the system and we evaluate the impact of the approximate 
procedure on this index under different working conditions. 

We first consider the case g = 1 (r = 2) corresponding to individual customer 
arrivals. We let the load vary in the system by changing the rate at which 
customers arrive at the first queue: Ai grows from 10 customers per second up 
to 18. The other parameters are constant: A 2 = ls“^, gi = 20s“^ and /i 2 = 2s“^. 

Left plots in Fig. O show E[N], the average number of customers in the 
system (considering both queues), versus the total arrival rate, Ai -I- A 2 , which 
is the total number of customers arriving at the system per time unit. The 
traffic intensity at the first queue, given by pi = Xi/gi, varies between 0.5 and 
0.9. The traffic intensity at the second queue, p 2 , is obtained by the ratio of the 
total arrival rate (considering both customers which arrive directly at the second 
queue and customers which first try to enter the first queue and finding it full 
move to the second queue) and the service rate p. 2 ', P 2 varies between 0.5 and 
0.95. Using different values for m in the procedure we get curves with different 
degrees of accuracy. The minimum value for m is / = 11, while the maximum 
value, m-M = lr—1 = 21, yields the exact solution. The figure shows that even for 
small values of m (jn = 14) the approximate results are almost indistinguishable 
from the exact ones. 

Right plots in Fig. O show the error computed as in II21I1 and used for the 
accuracy check versus increasing values of m. The dashed line refers to a high 
value of the traffic load (the total arrival rate equals 19, corresponding to p\ = 0.9 
and p 2 = 0.95); the solid line refers to moderately high traffic load (the total 
arrival rate equals 15, corresponding to p\ = 0.7 and p 2 = 0.56). When fixing a 
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arrival rate m 

Fig. 3. E[N] versus the total arrival rate for different values of m (left plots) 
and error versus m for total arrival rate equal to 15 and 19 (right plots); with 



level of accuracy P — 0.01, the procedure selects a value of m equal to 16 for both 
traffic values; notice from Fig. that for m = 15 the performance predictions 
are already very accurate. For P = 0.0001 the required value for m is 17. 

We consider now the case of variable customer group size; each group has a 
number of customers comprised between g = 1 and g = 4 with equal probabil- 
ities. The band extension in A is thus equal to 5 and matrix T has dimension 
Ir = 55. We show for this case similar results as those presented for g = 1. 

In the left part of Fig.j^we plot E[N] versus the total arrival rate of groups for 
different values of m. The traffic intensity at the first queue, pi, varies between 
0.5 and 0.75, at the second queue varies between 0.58 and 0.89. In this case 
curves become indistinguishable for m about 20. The procedure for the choice 
of m selects m = 20 for P = 0.01 and m = 24 for P = 0.0001. Finally, in the 
right plots in Fig. 0 we set the arrival rate to Ai = 15, A 2 = 1 (i.e. pi = 0.7 and 
P 2 = 0.8) and we plot E[N] versus the dimension of H^. We observe that for m 
around 20 the approximate results are almost identical to the exact ones. 

5 Conclusions 

The algorithm for the computation of the equilibrium solution of infinite MCs 
that was newly proposed in this paper derives mainly from the observation of 
the existence of a linear dependence among the problem unknowns, and, in 
particular, of the dependence of all the tt^’s on ttq. Realizing then that this 
dependence can be expressed by means on a finite matrix (T), the problem, 
whose dimension is infinite, can be reduced to a finite one. 

Most of the approaches that were previously described in the literature solve 
the infinite problem by finding some iterative relations among the unknowns, 
and then solving these relations by means of iterative procedures. Instead, we 
adopt quite a different approach: we find the linear relation among variables. 
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Fig. 4. E[N] versus the total arrival rate for different values of m (left plots) 
and E[N] versus the size of Hm, m + 1, for total arrival rate equal to 15 (right 
plots); with r = 5 



and derive the solution by imposing limit conditions which lead to simple direct 
equations. 

The proposed approach leads to a very efficient procedure for the computa- 
tion of the solution. The procedure provides at low cost both the exact equi- 
librium solution and a family of approximate solutions with growing degree of 
accuracy. 

Tests on a case study consisting in a model of a two-queue system show 
that the approach is quite promising, being efficient and accurate, even when 
providing approximate results. 
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Abstract. We consider the performance of a distributed, three-tier, 
client-server architecture, typical for large, Java-supported, Internet ap- 
plications. An analytical model is developed for the central schedulers 
in such systems, which can be applied at various levels in a hierarchical 
modelling approach. The system involves a form of blocking in which 
clients must wait for one of a number of parallel ‘instance servers’ to 
clear its outstanding work in order that a new instance may be ac- 
tivated. Thus, blocking time is the minimum of sojourn times at the 
parallel queues. We solve this model for the probability distribution of 
blocking time and obtain a simple formula for its mean value. We then 
use this result in a flow- equivalent server model of the whole system and 
compare our approximate results with simulation data. This numerical 
validation indicates good accuracy for the blocking approach per se as 
well as for system throughput, the performance objective chosen for the 
exercise. 



1 Introduction 

Distributed client-server systems, such as data mining systems, have become so 
complex that they cannot be designed efficiently without the use of some form of 
quantitative (performance) model; engineering experience alone is now too unre- 
liable. Simulation can sometimes be used for this purpose but analytical models 
are preferred where possible in view of their greater efficiency and flexibility for 
experimentation. They make approximations in order to obtain tractable solu- 
tions and can be validated initially, in simple applications, by simulation models. 
In this paper, we develop an analytical model for a distributed, three-tier, client- 
server architecture, typical for large, Java-supported, Internet applications. In 
particular, we focus on the central schedulers in such systems, giving sub-models 
that can be applied at various levels in a hierarchical modelling approach. 

The Kensington Enterprise Data Mining system is just such an architecture, 
whose application server is the critical performance component. The aim of the 
present research is the performance modelling and evaluation of this system and 
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specifically focuses on the behaviour of the application server. Data mining, or 
knowledge discovery in databases, is concerned with extracting useful new in- 
formation from data, and provides the basis for leveraging investments in data 
assets. It combines the fields of databases and data warehousing with algorithms 
from machine learning and methods from statistics to gain insight into hidden 
structures within the data. Data mining systems for enterprises and large or- 
ganisations have to overcome unique challenges. They need to combine access 
to diverse and distributed data sources with the large computational power re- 
quired for many mining tasks. In large organisations, data from numerous sources 
need to be accessed and combined to provide comprehensive analyses, and work 
groups of analysts require access to the same data and results. For this purpose, 
the existing networking infrastructure, typically based on Internet technology, 
becomes a key issue. 

The requirements for enterprise data mining are high-performance servers, 
the ability to distribute applications and the capacity to provide multiple access 
points . To fulfil these requirements, Kensington Enterprise Data Mining em- 
bodies a three-tier client-server architecture which encapsulates the main appli- 
cation functions inside an application server that can be accessed from clients on 
the network. The three tiers are client, application server and database servers, 
as shown in Fig.[H In addition, since enterprise data mining environments require 
flexibility and extensibility, the Kensington solution uses distributed component 
technology. 

Under this paradigm, the system has been designed using the Enterprise 
JavaBeans (EJB) component architecture and has been implemented in Java. 
Databases anywhere on the Internet can be accessed via Java Database Connec- 
tivity (JDBC) connections and the application executes in the Kensington EJB 
server, an EJB architecture implementation. 

The research carried out so far investigates one of the server functionalities 
which is the entity method execution call. The objective is to analytically model 





Fig. 1. Kensington Enterprise Data Mining Architecture 
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the method call behaviour in order to predict its performance, e.g. its throughput 
or mean response time. To study the behaviour of a distributed system, a model 
involving a number of service centres with jobs arriving and circulating among 
them according to some pattern is normally needed; typically a queueing network 
is appropriate. Therefore, the first stage of the investigation is to analyse the 
method call execution process in terms of queueing network concepts. Following 
this, the entity method execution call process is simulated and the results thus 
obtained used to validate the analytical model. 

The rest of this paper is organised as follows. Section [2l describes the Kens- 
ington EJB server and its characteristics as an implementation of the EJB ar- 
chitecture. Section El explains the entity method call execution process and its 
analytical model is developed in Section |H Numerical results are presented in 
Section Eland the paper concludes in SectionEl also considering future directions 
for investigation. 

2 Kensington EJB Server 

The Kensington EJB server fully implements the EJB-1.0 specification [S] and 
also supports entity beans (defined in the draft EJB- 1.1 standard). The EJB ar- 
chitecture is a new component architecture, created by Sun, for the development 
and deployment of object-oriented, distributed, enterprise-level applications. 



2.1 Enterprise JavaBeans 

Enterprise JavaBeans is an architecture for component-based distributed com- 
puting. Enterprise Beans are components of distributed transaction-oriented en- 
terprise applications. Each component (or bean) lives in a container. Transpar- 
ently to the application developer, the container provides security, concurrency, 
transactions and swapping to secondary storage for the component. Since a bean 
instance is created and managed at runtime by a container, a client can only 
access a bean instance through its container. 

EJB defines two kinds of components: session beans and entity beans. While 
session beans are lightweight, relatively short lived, do not survive server crashes 
and execute on behalf of a single client, entity beans are robust, expected to 
exist for a considerable amount of time, do survive server crashes and are shared 
between different users. 

A hean implementation and two interfaces, the bean interface and the home 
interface, define a bean class. A client accesses a bean through the bean inter- 
face, which defines the business methods that are callable by clients. The home 
interface allows the client to create (and, in case of entity beans, look up) a bean. 

^From the bean implementation and the interfaces, two new classes are au- 
tomatically generated using the container facilities: a bean class that will wrap 
the actual bean {Container Generated Bean or CGBean class) and a home class 
that allows a user to create a bean {Container Generated Home or CGHome 
class). Fig. shows the class structure. 
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Fig. 2. Class generation using container facilities 

2.2 EJB Server Implementation 

In the Kensington EJB implementation, each container is responsible for manag- 
ing one EJB. Thus, there is one container instance (from now on simply referred 
to as a container) for each bean class. Fig. |3] shows how a client uses different 
bean instances of the same bean class. 

The communication between clients and EJB server is done through Remote 
Method Invocations (RMI), which is a protocol allowing Java objects to com- 
municate [3 . A method dispatched by the RMI runtime system to a server may 
or may not execute in a separate thread. Some calls originating from the same 
client virtual machine will execute in the same thread and some will execute 
in different threads. However, calls originating from different client virtual ma- 
chines always execute in different threads. Therefore there is at least one thread 
for each client virtual machine. 

The Kensington EJB Server and container implementations use Java syn- 
chronised statements and methods to guarantee consistency when objects (i.e. 
bean instances and containers) are concurrently accessed. Due to the fact that 
synchronised statements and methods have an important influence on system 
performance, their detailed behaviour is explained in Section 12.21 below. Simi- 




Fig. 3. Client use of several bean instances 
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larly, the characteristics and behaviour of entity beans are described in more 
detail in Section 12.21 



Java Synchronisation. In order to synchronise threads, Java uses monitors 
2], which are high-level mechanisms allowing only one thread at a time to ex- 
ecute a region of code protected by the monitor. The behaviour of monitors is 
explained in terms of locks; there is a lock associated with each object. There are 
two different types of synchronisation: synchronised statements and synchronised 
methods. 

A synchronised statement performs two special actions: 

1. After computing a reference to an object, but before executing its body, it 
sets a lock associated with the object. 

2. After execution of the body has completed, either normally or abortively, it 
unlocks that same lock. 

A synchronised method automatically performs a lock action when it is invoked. 
Its body is not executed until the lock action has successfully completed. If the 
method is an instance method, the lock is associated with the object for which 
it was invoked (that is, the object that will be known as this during execution of 
the body of the method). For a class method, i.e. if the method is static, the lock 
is associated with the class object that represents the class in which the method 
is defined. If execution of the method’s body is ever completed, either normally 
or abortively, an unlock action is automatically performed on the acquired lock. 



Entity Bean Implementation. An entity container has a limited number of 
bean instances existing concurrently, i.e. there is a maximum number of active 
bean instances (i.e. instances in main memory ready to be accessed) for a bean 
class. 

Multiple clients can access an entity instance concurrently. In this case, the 
container synchronises its access. Each client uses a different instance of the 
CGBean class that interfaces the client with the container but all the clients 
share the same bean instance. As shown in Fig. IH when two clients access an 
instance concurrently, they share the CGHome instance (because there is only 
one of them for all the clients) but they use different GGBean instances. 

3 Method Call Execution 

Method invocations are made through the container in order to perform trans- 
parent user authentication and to allocate the necessary resources to execute 
the methods. As stated already, bean instances of the same bean class share the 
container. As a consequence, threads (or clients) using either the GGHome or 
GGBean instances of the same bean class share the lock for its container. The 
lock is requested when these objects invoke a container synchronised method 
since only one of the synchronised methods for each container can be executed 
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Fig. 4. Concurrent access from two clients 



at the same time. In addition, CGBean instances use the thread manager which 
has some synchronised methods. Therefore, all the CGBean instances share a 
resource, which is the thread manager lock. 

There is a limit to the number of method calls that can be concurrently 
executing. Hence, when the method call number reaches this maximum, the 
CGBean requesting a thread to carry out a method execution will wait until one 
of the executing methods finishes. Similarly, clients who share a bean instance 
share the lock associated with it when they use synchronised statements. The 
interaction diagram for a method execution call is shown in Fig. 



Thread 

CGBean Manager Container 



Bean 

Instance 




1 : If Instance is not 
active 

2: Until some 
instanoe not used 
3: Passivatelnstance 
(not used found) 
Activatelnstance 
(new one) 



Fig. 5. Interaction diagram for a Method Execution Call 



4 An Extended Queneing Model 



Based on the behaviour explained above, a queueing network model is shown 
in Fig. O The queueing network consists ofl + C + C^M stations, where 1 
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corresponds to the thread manager station, C is the number of containers in 
the system (i.e. the number of different bean classes) and M is the maximum 
number of (different) bean instances for a bean class that can be active at the 
same time. The particular set of active instances for any class, associated with 
a bean container and executing on up to M servers in the model, will vary over 
time according to the demands of the tasks departing from the container server. 
We assume that switch-over time between active instances at the parallel servers 
is negligible and accounted for in the container’s service times. The ‘waiting set’ 
(refer Fig. El) comprises threads not admitted to the whole EJB server network 
which also has a concurrency limit, N. As in traditional multi-access system 
models (see, for example, [3j), we solve for the performance of the whole closed 
queueing network, with the waiting set and all departures removed, at various 
populations N. 

For mathematical tractability and the desire for an efficient (approximate) 
solution, we assume all service times are exponential random variables, that the 
queueing disciplines are FCFS and that routing probabilities are constant and 
equal across each of the C bean containers. The M bean instances attached 
to each bean container are also equally utilised overall, but the specific routing 
probabilities in each network-state depend on the blocking properties, which are 
described below. 

To simplify this system, we apply the Flow Equivalent Server method (FES) 
12, which reduces the number of nodes by aggregating sub-networks into single, 
more complex (i.e. queue length dependent) nodes. Applying this method to our 
system, each FES sub-network consists of M -|- 1 stations where 1 corresponds 
to the container for a bean class and M is as above. After short-circuiting, this 
sub-network results in the closed one shown in Fig. |7] which will be analysed to 
obtain its throughput. 

The next step is to obtain the analytical model corresponding to the FES 
sub-network in order to determine the service rate function for a FES node 
in the overall network. Blocking is a critical non-standard characteristic in the 
FES sub-network; a client who has completed service in the container station is 
blocked if the required bean instance is not active and there is no free instance 
to passivate. As a consequence, the blocking time needs to be calulated. In 
conventional blocking models, see for example [Zj, blocking time is normally 
equal to the residual service time at some downstream server. Here, however, it 
is the time required for the first of the M parallel servers to clear its queue in a 
blocking-after-service discipline . 

The sub-model corresponding to the FES sub-network is appropriate to rep- 
resent the behaviour of various functionalities of this system — and others — 
at various levels in the modelling hierarchy. 

4.1 Blocking Time 

Given that a customer is blocked at the container server, so that all the instance 
servers have non-empty queues, the blocking time B is the minimum of the times 
for each of the M queues to become empty, with no further arrivals during the 
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time B. Let the random variable Ti denote the time for queue i to become empty, 
given that initially (i.e. when the customer is first blocked) there are > 1 
other customers in queue i, including the one in service (1 < * < M). Then the 
probability distribution function, Fn-(t), of Ti is Erlang-rii with parameter ^i, 

i.e. 



^ ( x\ k 

(t) = P„. (T, < t) = 1 - ^ 

k=0 

Now, the conditional (on queue lengths n = (ni, . . . ^um)) blocking time Tn = 
mini<i<M Ti has complementary cumulative probability distribution function 

M M tli — l , 

Pn(T„>t)=nm>i)=n E (2) 

i—l i—1 k—Q 
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Fig. 6. Global queueing network for a method execution 
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since the Ti are independent. Hence the unconditional blocking time has distri- 
bution function B(t) given by 

M tii — l , yf. 

Y. (3) 

n:Vi.ni>0 i—1 k—0 



where 7r(n) is the equilibrium probability that there are rii customers in queue i 
at the instant a customer becomes blocked at the container server. We make the 
approximating assumption that all valid states n (i.e. with > 0 for all i) have 
the same equilibrium probability. This property would hold were we to have 
the job observer property in decomposable, Markovian network with equally 
utilised servers 1, . . . , M, e.g. in the case that there was no blocking and we were 
considering departure instants from the container server. With this assumption, 
when the population of the servers 1, ... ,M is K, i.e. when ~ 

n, 7r(n) is the reciprocal of the number of ways of putting K — M balls into M 
bags — at least one must be in each bag. In other words, for all n. 



7r(n) = 7T = 






( 4 ) 



We now have the following result for the probability distribution of blocking 
time. 



Theorem 1 With the assumptions stated above, the probability distribution of 
the blocking delay suffered by a blocked customer at equilibrium is 



B{t) = 1 - 



{K-M)\ {K-k-iy. 

{K-l)\ ® ^ {K-M-k)\k\ 



{M-ptf 



( 5 ) 



when there are K customers at the M parallel servers, where p, is the average 
service rate of the M parallel servers, i.e. Mjl = 



Proof When there are K customers in queues 1, . . . , M, 



M 

Bit) = ^ 7T 

n- k: i—1 

yi.rii>0, '^ni=K Vi.0<fei<ni — 1 



E n 



hi 
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- E 

k: 

^ ki<K-M 



M 

E n 

Vz.nj>fci + 1, Y.n,=K 



h\ 



(6) 



by changing the order of summation. The inner summand is independent of n 
and the number of terms is equal to the number of ways of putting K — M — k 
balls into M bags, when k = (The reasoning here is that ki + 1 a,t least 

must be in each bag, using up fc + M of the K balls in total.) 

Consequently, we may write: 



B{t) 



{K -M)\{M - 1)! 



E 

k: 

Mi.ki>0, ^ ki<K-M 



{M-1)\{K-M-J2h)lf}^ kil 



{K-l)\ ^ {K-M-k)\k\ 



E 

k: 

Vi.fci>0, ^ ki^k 



M 



tin 



h\ 



{K-^ MW V {K-k-l)\ 
(if - 1)! ^ {K-M-k)\k\ 



{M-ptf 



( 7 ) 



and the result follows. 



We assume that all the M servers are equally likely to be chosen by each arrival 
and that their service rates are all equal to /r — this guarantees equal utilisations. 
We then have 



B{t) 



{K-M)\ Mut V" jK-k-l)\ 

(if - 1)! ^ (K-M -k)lkl 






(8) 



Mean blocking time, b say, which we use in the equilibrium model for the queue 
lengths for the whole system below, now follows straightforwardly. 



Corollary 1 Mean blocking time for blocked customers at equilibrium is 
K/{M'^jt) when there are if customers at the M parallel servers. 



Proof 



6 = 



B{t)dt 



{K-My.%^ (if-fc-1)! 
(if - 1)! ^ {K - M - k)lkl 




{Mjltf e~^^*dt 



(9) 



Changing the integration variable from t to s/Mp, then yields 



jK-M)\ (if-fc-1)! 

MTi{K-l)\ ^ {K-M-k)\k\ Jo 
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MJl{K -l)\ ^ {K -M-k)l 
{K-M)\ ^^{k + M-l)\ 



M-p{K-l)\ 



k=0 



k\ 



(by changing the summation variable to K — M — k.) 

We complete the proof by showing by induction on N that 

N 



E 

fe =0 



{k + m)\ (iV + TO + 1)! 
A:! (m + 1)A^! 



for N,m> 0. For N = 0, both sides are ml. Now assume that 

{k + m)l (n + m+ 1)! 



E' 



fc =0 



fc! 



for n > 0. Then we have 

n+1 

E 



(m + l)n! 

(fc + m)! (n + m+1)! (n + m+1)! 



fc =0 



fc! (m+l)n! (n + l)l 

(n + m + 1)! 

- ^ ' -{n + m + 2) 



(m + l)(n + 1)! 

as required. Substituting N = K — M and m = M — 1 now yields 

{K-M)l K\ K 

~ Mjt{K - 1)1 M{K - M)l ~ M^-p 
which completes the proof. 



( 10 ) 



( 11 ) 



( 12 ) 



(13) 



(14) 



4.2 The FES and Whole Model 

The complete (sub)model of a bean container and its M bean method execution 
servers at constant population N is shown in Fig. 0and detailed below. 




Fig. 8. Complete Model 



The service rate functions (with blocking) and f( 2 (fc), where k = N — j 
(for the FES) are defined as follows: 



Mi(i) 




if {N -j)>M 
otherwise 



(15) 
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where j is the number of clients in the outer (container) server and mi is mean 
service time for server 1 (the outer server) when there is no blocking. The pa- 
rameter Pn-j is the blocking probability, which we derive in the next subsection. 

Ttom if km 

^i 2 {k) = < Zk{M — l)/r -I- (1 — Zk)Mfx if / > M, k > M (16) 

[ M+Li k- Otherwise 

where k is the number of clients at the M parallel servers and /r is the service 
rate of each of them. The parameter is the probability that there is at least 
one idle server; it is also derived in the next subsection. 

Clearly the visitation rate is the same for both servers. The steady state 
probability distribution for this network - p{j) for the state with j tasks at 
server 1 and N — j at server 2 - is then calculated as a product form in standard 
fashion (see, for example, 0). Throughput T is then given by 

N 

T = ^p(j)/ri(j) . (17) 

i=i 



4.3 Instance- Active and Client-Blocking Probabilities 

Let a denote the probability that the instance required by a task arriving at 
the M parallel servers is active, i.e. that the task can immediately join that 
instance’s queue (whether or not empty) and so is not blocked. We approximate 
a by M/I. Let Zk denote the equilibrium probability that, when there are k 
tasks at the M servers altogether, at least one of them is idle. Then the blocking 
probability is/3fc = (l — Zfc)(l — a). 

The parameter can be estimated by considering a simple two dimensional 
Markov chain (Kt,Zt) where, at time t, Kt represents the number of tasks in 
total at the M parallel servers and Zt = 0 if there is at least one empty queue 
amongst the parallel servers, Zt = 1 if not. A more precise model would replace 
Zt by Et, the number of empty queues, i.e. idle servers, out of the M. However, 
we use an even simpler submodel to estimate Zk- a two state Markov chain Zt 
for each population size k at the parallel servers. Clearly this is an approximate 
decomposition since Kt and Zt are not independent, but it does capture the 
essence of the blocking that is present. This results in a relatively low probability 
of empty queues (when k > M) since it will often be the case that a task is 
blocked when a server with only one task completes a service period, resulting 
in an empty queue which is instantaneously occupied by the unblocked task; the 
empty state is therefore never seen. 

For population size k at the parallel servers, let the equilibrium probability 
that Z = Zoo = 1 (respectively Z = Q)he denoted by 7Tfc(l) (respectively 7Tfc(0)). 
Obviously, for k < M, there is always an empty queue and so 7Tfc(0) = 1 and 
7Tfc(l) = 0. For N > k > M, the balance equations for the submodel are then 



, , I - M + 1 , . 

TTkiOjm^ u = 7rfe(l)m/r 



miM^/r 



miM'^p + fc(l — a) 



( 18 ) 
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where, at population k, u is the probability that there is exactly one empty 
queue, given that Z = 0, m is the average number of singleton queues (i.e. with 
exactly one task) given that Z = 1 (no empty queues). The fraction on the left 
hand side represents the probability of a task not choosing an instance at one of 
the M — 1 busy servers. The fraction on the right hand side is the probability 
of observing a task at the front of the container queue in a blocked state, as 
opposed to receiving service with mean time mi. 

Now, by symmetry, u is the probability that queue number 1 is the only 
empty queue, given that it is empty, and so may be estimated as the ratio of the 
number of arrangements of k tasks in M — 1 queues with at least 1 task in each, 
to the number of arrangements of k tasks in M — 1 queues. This can be written. 



fc!(A:- 1)! 

{k- M + iy.{k + M -2y. ■ 



(19) 



Notice that if M = 2, u = 1 and if M = 3, u = (fc — l)/(/c + 1) as required. 

Next, m = Mv where v is the probability that queue number 1 (for example) 
has length exactly one, given that its length is at least one. Thus we estimate 
V as the ratio of the number of arrangements oi k — M tasks in M — 1 queues, 
to the number of arrangements oik — M tasks in M queues; after assigning one 
task to each queue, there are only k — M left. Hence, v = and m = 
for k > M . 

We now estimate Zk by 7Tfc(0) and f3k follows ior M < k < N. 



5 Results 

Since it is only possible to analytically model complex distributed systems using 
several simplifications and assumptions, approximation techniques have to be 
- and have been - used. Consequently, theoretical results need to be validated 
by comparing them with those obtained from simulation. A simulation program 
was written using QNAP2 V. 9.3, a modelling language developed for INRIA 
(Institut Nacional de Reserche en Informatique et Automatique) . Among other 
facilities, this language provides a discrete-event simulator. 

The simulations were run for 100,000 time units, using the batch means 
method, and the resulting confidence intervals had an error of less than 5% at 
a 95% level of confidence. The graphical representation of the simulation and 
analytical results for different values of I (number of bean instances) , N (number 
of threads), and M (number of paral.lel servers) is shown in Fig. E] Fig. [10] and 
Fig, rm respectively. The parameter r relating to sequences of calls to the same 
instance was set to 0.8, mi = 0.4 and fi = 1/4.1. 

6 Conclusion 

The Kensington Enterprise Data Mining system is a real example of a distributed 
client-server system. As such, its complexity makes its analytical modelling non- 
trivial. The approach we have followed is to study isolated functionalities of the 
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Ms6, 1=20 




Fig. 9. Throughput comparison for simulated and analytical results for M=6, 
1=20 and N:0-40 




Fig. 10. Throughput comparison for simulated and analytical results for M=6, 
N=30 and 1:1-40 



N=30, 1=40 




M 

Fig. 11. Throughput comparison for simulated and analytical results for N=30, 
1=40 and M:3-24 
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system and then combine them in a hierarchical methodology. We started by 
analysing the behaviour of a method execution call and divided the network 
into two sub-networks (FES and complement sub-networks). In this way, we 
were able to isolate the blocking that occurs when an instance is inactive and 
all servers are busy, yielding an approximate two-server model. 

Numerical results showed generally good agreement with simulation, espe- 
cially at large M and /, where contention is less. However, even at high con- 
tention the error was less than 5%, suggesting a good building block for the 
modelling of complex, distributed, object-based scheduling systems. 

Future work will focus on studying other functionalities of the system, where 
some of the results and techniques used in our method execution analysis will be 
re-applied. In particular, we will focus our attention on a more detailed model 
of the outer layers of Fig. El in which a more conventional form of blocking 
after service exists. There are also certain improvements that can be made in 
the accuracy of some of our model’s parameters. In particular, a more detailed 
Markov model is being developed to determine Zk and hence the service rate 
function of the FES. 
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Abstract. The paper proposes a method to optimize the resource uti- 
lization of a GSM mobile network which takes into account subscriber 
numbers, subscriber profiles and mobility patterns of subscribers. First 
the whole problem is modeled analytically as a system of equations. The 
system is solved by splitting it into linear subproblems, which are solved 
successively or iteratively by LP programs. Finally the integration of var- 
ious additional requirements on the network design into the given model 
is discussed. 



1 Introduction 

The first GSM mobile networks came into operation in the early nineties. Since 
then there is a tremendous increase in subscribers and a continuous expansion 
of the network infrastructure. 

Except for smaller nodes for subscriber and equipment administration the switch- 
ing part of the network mainly consists of the mobile switching centers MSCs. 
The radio part consists of the base station transceivers BTSs serving several 
radio cells and the base station controllers BSCs. These three network elements 
form a hierarchical structure: the BSCs are connected directly to the MSCs 
whereas the BTS are connected directly, in multidrop or in a loop to the BSCs. 
Accordingly the BTS are clustered in BSC regions, where by a BSC region we 
understand all the BTSs connected to the same BSC, and in MSC regions con- 
sisting of all the BTSs connected via BSCs to the same MSC. The mobility 
management in a mobile network furthermore requires the definition of location 
areas. A location area is a connected subset of an MSC region or corresponds to 
a whole MSC region. If a subscriber a wants to communicate with a subscriber b 
known to be in a certain location area, a signal is simultaneously broadcast in all 
cells of the location area in order to find out in which cell he actually is. Large 
location areas produce a high load on the broadcasting channel because many 
subscribers might have to be traced at the same time. On the other hand the 
design of many location areas in an MSC region requires more resources at the 
MSC as will be explained in section 2. A BSC region normally always belongs 
to one location area (for a more detailed introduction to GSM see [2], [7]). An 
example of a network topology is given in figure 1. The fixed network planner 
gets the locations of the BTSs, of the MSCs and their technical configurations 
as an input. His task is to decide on the number and location of the BSCs, the 
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Fig. 1. Example of an MSC region with two location areas and three BSC 
regions. 

way the BTSs are connected to the MCSs via the BSCs and the design of the 
location areas. His design will be characterized by technical feasibility, optimal 
equipment utilization and cost minimization. During the evolution of the net- 
work he will continuously reconsider and — with a decreasing degree of freedom 
— eventually change the network structure. 

In this paper we focus on an optimal use of the equipment. More precisely we 
will present a mathematical method for optimizing the use of the resources of the 
MSCs in an already operating GSM network by restructuring certain elements 
of the network topology. 

2 Exact Description of the Problem 

In this section we first explain the functionalities of an MSC in a mobile network 
which are relevant to our problem. Then we illustrate the optimization task. 
The MSC has two main functionalities: 

1) Call processing. 

2) Mobility management. 

By call processing we mean the handling of any voice or data communication. 
The resources required at the MSC for call processing depend on the kind of 
communication(data, fax, short message service etc.) that takes place and on 
the position of the two subscribers relative to the network topology. One has 
to distinguish for example between calls coming from or going into an external 
network and calls remaining in the same network. For the calls that remain in 
the same network the resources required at the MSC depend on whether or not 
both subscribers belong to the same BSC and/or MSC region. 

By mobility management we understand all transactions in the system which 
are caused by the tracing and registration of the actual cell of the network the 
individual subscriber is in. Basically two scenarios have to be considered here: 
Handover and location update. 




134 



C. Bauer 




Remark: In practice the MSC region borders are not straight lines as shown above, but 
follow the often hexagonal structure of the radio cells. 



By a handover we mean the change of a cell by a subscriber in dedicated mode 
without loosing the connection by resource allocation to the new cell. The re- 
source requirements at the MSC mainly depend on the position of the cell he 
leaves and the cells he moves into relative to the network topology. We differ- 
entiate between three cases: both cells belong to the same BSC region, but not 
to the same BTS, they belong to the same MSC region, but not to the same 
BSC region or they do not belong to the same MSC region. (We do not have to 
consider handovers between cells belonging to the same BTS.) 

A location update is initialized when a subscriber moves from one location area 
to another. Concerning the required resources of the MSCs we have to distin- 
guish between two scenarios: if the old and the new location area belong to the 
same MSC region or they do not. 

The exact way call processing and mobility management have to be consid- 
ered in order to get a precise estimate for the resources they require at different 
components of the MSC might be very different for products from different man- 
ufacturers. 

In order to estimate the amount of traffic that will be generated in a region to 
be planned the network planner needs not only to know the number of expected 
subscribers, but also to have a traffic model characterizing the different classes 
of subscribers. This must include the call attempts, mean holding time, activty 
ratio, percentage of calls coming from within and without the network etc. per 
subscriber per class, but also information related to his mobility behavior. 
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In the network shown in figure 2, there are three adjacent MSC regions A, B 
and C with a highway passing through regions A and C. We suppose that the 
average subscriber in all the three regions is characterized by the same traffic 
model which in this case does not contain information on mobility behavior. 
Moreover we assume that the numbers of subscribers in all three regions are 
approximately equal. Suppose an analysis of the required resources at the three 
MSCs A, B and C shows that the resources of the MSC B are used to 60 %, 
whereas the resources of the MSCs A and C are used to 80 %. In the present 
situation the three MSCs can handle call processing and mobility management 
without difficulty. If for the future we assume the traffic model to remain the 
same and further assume an equal subscriber growth in all the three regions, the 
MSCs A and C will reach their capacity limit at a number of subscribers where 
MSC B could still handle more subscribers. In order to prevent this a criterion 
for a network design must be to reach a low, similiar load on all the three MSCs. 
Due to the equality assumptions regarding subscriber numbers and traffic models 
in the different MSC regions the unequal resource requirements can only be due 
to unequal resources required by the mobility management in the different MSC 
regions. In the example subscribers might actively use their handset while driving 
on the highway from MSC region A to C and then back to A. Every time they 
cross the border between the MSC regions A and C they require resources at 
the MSCs A and C due to handovers/location updates. Let us assume a deeper 
analysis of the network shows this scenario happens frequently and leads to the 
unequal MSC resource requirements. A way to avoid this could be to change the 




Fig. 3. Modified Network 
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border between the regions A and C in such a way that the highway runs entirely 
in region A. This action alone would enlarge MSC region A and so increase the 
resource requirements at MSC A due to call processing. Therefore one shifts the 
borders between the regions A and B and the regions B and C in such a way that 
the expected subscriber numbers in the three regions are approximately equal 
(see figure 3). 

In similar scenarios it might be very difficult to find the reasons for the unequal 
resource requirements. But even if the reasons are known, it is difficult to decide 
on a redesign of the network that solves the existing problems without causing 
other problems. Theoretically the network planner can assign any BTS to any 
of the three MSC regions and then redefine the location area and BSC regions, 
which again can be done in numerous ways. (In practice there is the restriction 
that the cells of the BTS belonging to one BSC/MSC - region or a location area 
must form a connected region and not two or more disconnected regions.) The 
huge number of redesign possibilities of the network, which is of exponential order 
of the number of network elements, shows that it is impossible for the planner 
to solve the problem of low and approximately equal resource requirements at 
the three MSCs A, B and C optimally. 

The scope of this paper is to give an analytical expression for the general problem 
of using the MSC resources in a mobile network in such a way that for a given 
traffic model for an average subscriber and given locations of the BTS, BSC and 
MSC a maximum number of subscribers can be served by the mobile network. 
Based on this we will describe a mathematical procedure for designing the MSC 
regions, the BSC regions and the location areas such that a maximal increase 
of subscribers can be handled by the network. We will not allow for the change 
of BTS, BSC or MSC locations or the introduction of new network elements in 
our procedure. 

Several aspects of the problem treated here have been investigated in literature. 
The problem of an optimal design of location areas was considered in [2] as 
well as in [5], [6] and [8] where it is approached by using integer programming 
models. In [3] a greedy heuristic algorithm is applied to the problem. The impact 
of mobility management on the MSCs was considered in [9] where a list of 
important parameters is provided. The present paper distinguishes itself from 
these contributions by giving a model and a solution for the complete problem. 



3 Mathematical Model of the Problem 



We formulate the problem in the following way: 

Choose the variables Xk,i, ym,i, Zo,i 



such that 



g may be chosen maximally large 
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and (1) - (11) hold: 

N M N 

Y,akAS{l + 9))xk,^ + Y,Y. Y. 

i=l i=l i=l jgAT(j) 
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E E E E 

mGM(k) neM(k) i=l j^N(i) 

N 

E E E E E g^^Zo p i j ^ Lk 

m^M(k) o^L{m) p€L(m.) i — 1 
o<p 

ykG{i,..,M}. (1) 
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V/c, I G {1,..,M}, fcyf Vto, n G {1, L}, to n, 
Vo,p G {1, S}, o^p,\/ie {1, N}, Vj G N{i). 



^k,i 






ym,i 



1, if BTS i is in MSC region fc, 1 

0, otherwise. J 

1, if BTS z is in MSC reg. k and BTS j in MSC reg. I, 
0, otherwise. 



1, if BTS z is in location area to, 
0, otherwise. 
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r 1, if BTS i is in loc. area m and BTS j in loc. area n, 

[ 0, otherwise. 

r 1, if BTS i is in BSC region o, 

[ 0, otherwise. 

r 1, if BTS i is in BSC reg. o and BTS j in BSC reg. p, 

\ 0, otherwise. 

The following three variables are all functions of the total number of subscribers 
in the whole network which is here expressed by u: 

^k,i,i,j{u) = resources required at MSC k if = 1. 

= res. req. at MSC k if = 1 and m,n € M{k). 

Co,p,i,j{u) = res. rec. at MSC k if Zo^p,ij = 1, o,p G L{m) 
and m G M(k) for any m € L. 

The above definitions hold for the ranges of k, I, m, n, o,p given after (11). 

S = total number of subscribers in the whole network , 
g = subscriber growth rate 
j G N{i) BTS i adjacent to BTS j, 
m G M{k) location area m lies in MSC region k, 

o G L{m) BSC region o lies in location area m, 

M = number of MSC regions, 

L = total number of location areas in all MSC regions 
B = total number of BSC regions in all MSC regions 
N = number of all BTSs, 

Lfc = maximum available resources of MSC k. 

To explain the model we first describe the functions of the variables Xk,i, ym,i, 
Zo,i, and Zo^p^ij together with the equations (2) - (11) and then 

explain equation (1). 

The variables Xk,i, ym,i, and Zo,i, reflect the network structure. Each BTS is as- 
signed to an MSC region, a location area and a BSC region. The way the BTSs 
are connected to the BSCs is not specifled. 

The variable Xk,i assigns a BTS i to an MSC region k, were (2) ensures that i 
is assigned to exactly one MSC region k. ym,i assigns the BTS t to a location 
area to. Equation (3) guarantees that i is assigned to exactly one to, where to 
belongs to the MSC region k the BTS i is assigned to in (2). Analogously Zo,i 
together with equation (4) ensures that the BTS i is assigned to a BSC region 
o, where o belongs to the location area to chosen in (3). 

The variables Ym,n,i,j and Zg^p^ij contain informations on mobility sce- 

narios that influence the resources required at the MSCs. 

The variable Xk^i^ij takes the value 1 if the neighboring BTSs i and j belong 



Y 



Zo,2 
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to the MSC regions k and I respectively {k ^ 1) and 0 otherwise. This is ex- 
pressed by the equations (5) and (6) which show that is equal to 1 only if 

Xk,i = xij = 1 and equal to 0 otherwise. Ym,n,i,j is equal to 1 if the neighboring 
BTSs i and j belong to the location areas m and n respectively (m yf n) and 
equal to 0 otherwise. It is defined by the equations (7) - (8). contains the 

analogous information for the non identical BSC areas o and p and is defined by 
the equations (9) and (10). 

The right side of equation (1) gives the maximally available resources at MSC 
k. The left side of equation (1) equals the required resources at MSC /c at a 
time ti, when the total number of subscribers has grown from S to S{1 + g). 
The variable S defines the total number of subscribers in the whole network at 
a time tg and g > 0 is the subscriber growth rate in the network from time to 
until a time ti > to- 

The first sum on the left side of equation (1) expresses the resources required at 
MSC k by call processing. If and only if BTS i belongs to MSC region k, then 
Xk,i = 1 and BTS i causes a load of afe,i(S'(l-|-g)) at MSC k. ak,i{x) is a function 
of one variable x and two functions fi{x) and gi. x is the number of subscribers 
in the whole network, fi{x) is the average number of subscribers served by BTS 
i if the total number of subscribers is equal to x and gi is a traffic model per 
average subscriber at BTS i. As we suppose the distribution functions ft and 
the traffic models not to change, we omit them in our notation of ak,i(x). As a 
function in x Qk,i{x) is monotonously increasing. 

The second sum expresses the resources required at MSC k by handovers and 
location updates between radio cells inside and outside MSC region k. As ex- 
pressed by the definition of such a scenario can only take place between 

a BTS i belonging to MSC region k and a neighboring BTS j belonging to 
a MSC region 1. Akj,ij{x) is the respective load caused by BTS i at MSC k. 
Again Akj^ij{x) is a function of the total population x, of fi{x) and gi and it is 
monotonously increasing in x. 

The third sum equals the resources required at MSC k by location updates and 
handovers between location areas in the MSC region k and is to be understood 
in the same way as the second sum. Analogously the fourth sum contains the 
resources required by handovers between BSC regions belonging to the same 
location areas. 

The variables Xk,i, ym,i, Zo,i, have to be chosen in a way such that g can be cho- 
sen maximally large under the conditions imposed by (1). Equation (11) ensures 
that we either obtain a positive subscriber rate or the system has no solution, 
i.e. the maximum number of subscribers that can be served by the network has 
already been surpassed at time to. 

Remark : We do not give any measure for the required resources of the MSC 
in order to keep the model as general as possible. In practice one might not 
calculate general MSC resources, but instead calculate the resources required at 
one or several components of the MSCs, which are most critical in the described 
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application. If we consider several components, equation (1) might have to be 
substituted by a corresponding set of equations for each element. 

Until now we have supposed that the planner predefines a hierarchy between 
MSC regions and location areas via the sets M{k) and between location areas 
and BSC regions via the sets L{m). This assumption can be weakened as follows: 
We again determine a hierarchy between MSC regions and location areas via the 
sets M{k), but instead of assigning BSC regions fixed to a location area via the 
sets L{m), we define the following sets: 

H{m) = set of all BSC regions that can be in location area to, 

G{o) = set of all location areas that can contain BSC region o, 

where o G H{m) to € G{6) and both sets are required not to be empty 
Vto G {1,..,L},Vo G {1,..,B}. By defining the sets H{m) and G{o) different 
possibilities of assigning the BSC regions to the location areas are left open and 
the actual choice of an assignment is conditioned by the maximization of g. The 
number of location areas and BSC regions is fixed as in the original model. We 
define new variables 

r 1, if BTS i belongs to BSC region o, BTS j to 'j 

Wm,o,p,i,j = \ BSC region p and BTSs i and j to location area to > 

[ 0, otherwise. J 

Using these variables we define the relations: 

10000(2 - Zo,i - Zo,j) > \ ^ rn(ym,i - y7n,j)\, (12) 

mGG{o) 

B 

Y.^o,i = l, (13) 

0^1 

zo,i> ym,ij (14) 

^ ^o,i ^p,j ym,i ym,j 3, (16) 

Vto G {1,..,L}, Vo,p G {1,..,B}, o^p,\/iG {l,..,fV}, Vj G N(i). 

We now change our model by substituting equation (4) by (12) - (14), (9) by 
(15) and (10) by (16). Then we substitute the equations (1) by 

N M N 

Yak,i(s(l+9))xk,^ + YY Y Ak,i,i,j(S(l + g))Xk^i^ij + 

i=i i=ii=ijgAT(q 
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N 

E E E E E 

mGM(k) oGH(m) p€H(m) i=l j^N(i) 

o<p 

V/c G {1,..,M}. (17) 

Equation (13) ensures that each BTS i is assigned to exactly one BSC. The 
inequality (14) together with (2) and (3) assigns the BTS i furthermore to a 
BSC region out of the set unique set H{m) with ym,i = 1- Finally equation (12) 
ensures that two BTSs which are assigned to the same BSC also belong to the 
same location area, i.e. a BSC region always belongs completely to one location 
area. Due to the large factor 10000 equation (12) is without significance if Zo,i 
and Zoj are not both equal to 1. Otherwise the right hand side of (12) must 
vanish. Using again that ym,i = 1 for exactly one m W i G N and the fact that 
in (12) we sum over pairwise different positive integers m G G(o), we obtain 
ym,i = Umj = 0 Vf G {l,..,fV}, Vj G N{i) for all but one m G G(o). For the 
remaining value m holds ym,i = Umj = 1 and therefore the BTSs i and j belong 
to the same location area. 

The equations (17) differ from (1) by the Wm,o,p,i,j which replace the Zo^p^ij. 
They take the value 1 if and only if the BTSs i and j belong to the BSCs 
o and p respectively and if both belong to the location area m. This corre- 
sponds to the relations (15) and (16) because Wm,o,p,i,j = 1 if and only if 
Zo,i = Zpj = yra,i = ym,j = 1 and equal to 0 otherwise. In (17) we sum over 
the potentially larger set H{m) instead of summing over L(m) as in (1). If an 
o G H (to) is actually not assigned to the location area m then ym,i = 0 for any i 
with Zoy = 1 and therefore Wm,o,p,i,j = 0. Therefore the summation over H{m) 
reduces in fact to the summation over those o which are assigned to to. For these 
holds ym,i = 1 for any i with Zo,i = 1 and so Wm,o,p,i,j = Zo,pyj due to (9), 
(10), (15) and (16). 

In practice the model looses some of its complexity because some of the variables 
Xk,i, ym,i and Zo,i might be assigned a fixed value by the planner. The BTSs near 
to a BSC might be assigned fixed to the BSC and similarly some BSCs might 
be connected with some MSCs. 

Integration of Further Network Design Criteria into the Model. The 

model given above does not contain various technical conditions and design prin- 
ciples which in practice influence the network structure. In the following we will 
explain how to integrate some of these into the model. 

a. The contribution of the Bm,n,i,j in (1) shows that the design of more location 
areas requires more resources at the MSC. Therefore although in our model we 
determine a number of location areas per MSC, i.e. the number of elements of 
M{k), an optimal solution of the model will assign all the BTSs in one MSC 
region to one location area in order to reduce the resources required at the MSC. 
This can be prevented by predefining a minimum ratio 0 < g < 1 of BTSs per 
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MSC region that have to be assigned to each location area, i.e. 

N N 

> q'^Xk,i Vfc G Vto G M{k). 

i=l i=l 

b. In practice there are capacity constraints on the BSCs and MSCs. Suppose 
for example there is a limit on the total traffic that can be handled by a BSC. 
Let C be the limit and Ti the average traffic generated at BTS i. Then we add 
the following equation to our model 

N 

zo,iTi < c, 

i=l 

valid for any BSC region o. Similarly one could limit the size of a location area 
in order to reduce the load on the broadcast channels. 

c. In section 3 we supposed an equal subscriber growth g in all MSC regions. 
Assume that one instead expects a different subscriber growth in every MSC 
region k. We assume min g{k) = g\ and set hk = gu/gi- Then we substitue g by 

k^M 

gihk in (1) for each k G M and choose the variables Xk,i, ym,i and Zo,i such that 
gi can be chosen maximally large. 

d. The introduction of new services - for example Intelligent Network services 

- might require additional resources at the MSCs. If the resource requirements 
do not depend on subscriber mobility, but only on subscriber number and pro- 
file, the resource requirements can be integrated in the functions ak,i- If they 
are directly related to any of the three mobiltiy scenarios expressed by the vari- 
ables Ym,n,i,j or Zo,p^ij the resource requirements can be integrated in 

the functions Ak^i^ij, Bm,n,i,j or Co,p,ij- If they are related to a subscriber be- 
haviour not expressed by any of the variables (5) - (10), other variables have to 
be introduced. 



4 Optimization of the Network Topology 

For large numbers of network elements it is impossible to solve the model given 
in section 3 exactly even if we suppose that ak,i{x), Ak,i^ij{x), Bm,n,i,j{x) and 
Co,p,i,j{x) are linear functions in x. 

In practice the model is often neither needed in its full functionality nor applied 
to the whole network. In the example described in scetion 2, the planner will 
not try to change the whole network, but only the areas at the MSC region 
borders. Moreover among the different handover/location update scenarios there 
is generally one which requires significantly more resources at the MSC than the 
other scenarios. In the given example these might be the handovers and location 
updates at the MSC region borders. 

In a first optimization step using these assumptions we neglect the other mobility 
scenarios and set all ym,i and Zo,i equal to zero. In order to obtain a complete 
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system of linear equations we approximate the ak,i{x) and Ak,i^ij{x) by linear 
functions tk,i{x) and If a good approximation by one linear function is 

not possible, the procedure is run with several linear functions which ” envelope ” 
the original function. The various solutions g, Xk,i, ym,i, Zo,i, and 

Zo,p,i,j, found in this way are checked for their compatibility with the equations 
(1) for the original function. The compatible solution with the largest g is taken 
as the final solution. 

In the following we suppose that approximation by a linear function is possible. 
Writing g = l//i we can reformulate the problem as follows: 



Choose the variables Xk,i 



such that 



h can be chosen as small as possible 



and (2), (5), (6), (18) and (19) hold, where: 



N 



M N 



(1 + /i) I ^ tk,i{s)xk,^ + Y.H Y. TkMAS)Xk,i, 

Vfc e {1,..,M}, 



< hLk 



i=l 



l^k 



(18) 



0 < h < oo. (19) 

The above model was applied to several network topologies of existing networks. 
In each of the examples adjacent border areas of two or three MSC regions were 
considered as the region to be restructured. The values for ak,i{x) and Ak^i^ijXx) 
were modeled on the base of traffic measurements at the MSCs and BSCs/BTSs. 
LP software was applied to problem sizes up to 300 BTSs and 3 MSCs (see [1]). 
In most cases the optimization significantly improved the performance data of 
the MSCs. In several test cases excessive requirements of MSC resources were 
due to roads crossing the border between two MSC regions several times in a 
zig-zag movement. The application of the optimization procedure gave a new 
network design which reduced the number of times the roads crossed the MSC 
region borders. The running times on a Pentium PC were several minutes. 

In a second step one could apply the model in section 3 to each of the MSC re- 
gions separately and redefine only the BSC regions while neglecting the location 
areas. Therefore we require (1) only for the region k actually considered and set 
all Xk,i = 1 within this region. We set all ym,i equal to zero and substitute the 
equations (2) - (4) by 

m£M{k) o£L{m) 

In none of the test scenarios did the redesigned network contain disconnected 
BSC or MSC regions. Even if the modeling of the problem does not explicitly 
exclude this, the objective function excluded such topologies because of the re- 
sources of the MSC that would be required for mobility management. 
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In practice optimization problems are very often of a local nature and the amount 
of resources required at the MSC by a certain mobilty scenario dominates the 
amount of resources required by the others. This justifies the action that on 
the one hand we split a bigger problem into numerically smaller problems by 
optimizing different areas successively or even iteratively. On the other hand 
we can divide the optimization task into several steps by minimizing at each 
step the resources required by a certain scenario. This could also be done in 
iterative loops. In this way we obtain many small subproblems, which can all 
be modeled by the model presented here and solved by conventional LP software. 



5 Conclusions 

The paper provides an optimization procedure for the use of MSC resources 
in a GSM network. It can be easily adapted to other technologies like UMTS 
which have similar network topologies. Mathematically the problem is treated 
by straightforward linear programming techniques with a subsequent partition 
of the problems into several subproblems. 

The importance of this contribution consists mainly in its practical application 
by the fixed network planner. The very complex network design task will not 
only be completed much faster than without tool support, but the result will 
provide essential improvements in terms of resource utilization and therefore 
lead to considerable investment savings for the network operator. 
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Abstract. This paper examines a distributed system where users em- 
ploy a mobile software agent to perform a sequence of tasks associated 
with different network nodes. Each operation can be carried out either 
locally or remotely, and may or may not involve moving the agent from 
one node to another; in general, all these options have different costs. The 
problem is to determine the optimal agent allocation policy, for a given 
cost structure and pattern of user demand. The methodology adopted is 
that of Markov Decision Processes. Two numerical approaches are pre- 
sented for the general problem, and a closed-form solution is obtained in 
a non-trivial special case. 



1 Introduction 

Consider a distributed system consisting of N nodes connected by a network. 
Certain user operations, which typically involve accessing and/or processing of 
information, are carried out by a mobile software agent. When the required 
information and the agent are at the same node, the operation is said to be 
‘local’. However, the agent is also able to 

— perform a ‘remote’ operation at one node, while residing at another; 

— move from one node to another. 

The average costs associated with these activities are specified by two ma- 
trices. The first is 

C = (1) 

where Cij is the average cost of performing an operation at node j when the 
agent resides at node i ; in particular, d^i is the average cost of a local operation 
at node i. The second matrix is 

D = , (2) 

where dij is the average cost of moving the agent from node i to node j ; it is 
natural to assume that di^i = 0 for all i. 

Both types of costs would normally, but not necessarily, be expressed in terms 
of delay times. 
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Suppose further that the pattern of demand can be described by a Markov 
chain with N states and transition probability matrix 

( 3 ) 

That is, the probability that the next operation involves information stored at 
node j, given that the last was associated with node i, is equal to qij, regardless 
of past history. The intervals between consecutive requests are irrelevant for the 
purpose of this study; they are assumed to be of length 1, to enable us to work 
in a discrete time setting. 

The questions that we wish to address in this context are concerned with the 
optimal dynamic positioning of the agent. Clearly, there are trade-offs between 
the benefits of performing operations locally, and the costs of moving the agent. 
Moreover, the evaluation of those trade-offs depends on the criterion of optimiza- 
tion. Thus, if the aim is to minimize the cost of satisfying a single request, then 
it is enough to compare the (at most N ) actions that are possible at the time, 
and choose the cheapest. However, one is normally interested in minimizing the 
total cost incurred over some event horizon which may be finite or infinite and 
may or may not involve a discount factor. 

We shall tackle this problem by using the tools of Dynamic Programming, and 
in particular those of Markov Decision Processes [21317181 . This methodology has 
been employed extensively in a variety of contexts. To mention a few examples, 
Kim and Van Oyen recently studied the optimal scheduling of a shared server 
in a production line. Hordijk and Koole [4] considered a problem of optimal 
control in a multiprocessor system with non-Poisson arrivals. A collection of 
papers by eminent authors in the area of optimization can be found in part VII 
of 0. 

As far as we are aware, the problem described here has not been addressed 
before. The motivation for the present study came from a real, albeit experimen- 
tal system whose aim was to provide users with personal information gathering 
agents. 

Section 2 considers the finite horizon optimization. Finding the best policy 
in this case reduces to solving a set of recurrence relations. The infinite horizon 
discounted optimization is examined in section 3; an optimal stationary policy 
can be found by applying a ‘policy improvement algorithm’. These developments 
are quite straightforward. Some numerical results are presented, showing that 
the optimal policy is not always easily predictable. Run times for the solution 
algorithms are mentioned. Next, an important and non-trivial special case is 
solved analytically, leading to an explicit, closed form characterization of the 
optimal policy. This result, which is perhaps the central contribution of the 
paper, is presented in section 4. A couple of possible generalizations are discussed 
in the conclusion. 

2 Finite Horizon Optimization 

Requests arrive into the system at times t = 1,2, . . .. The system state at time 
t is described by a pair of integers, (i,j), where i is the current location of the 
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agent, and j is the location of the information to be accessed (both i and j are 
node indices in the range 1,2,..., N). There are N mutually exclusive actions 
that can be taken in any state: ai, 02 , . . ., oat. Action Ofc consists of moving 
the agent from its current location to node k and then performing the required 
operation. According to m and ©, the average cost incurred when the current 
state is (i,j) and action ak is taken, is equal to di^k + Ckj- 

Denote by Vij{n) the minimal expected total cost incurred in satisfying n 
consecutive requests, given that the system state at the time of arrival of the 
first of them is (i, j). The cost of an action performed m steps into the future is 
discounted by a factor a"* (m = 1, 2, . . . , n — 1; 0 < a < 1). Setting a = 0 implies 
that all future costs are disregarded; only the current operation is important. 
When 0=1, the cost of a future action, no matter how distant, carries the same 
weight as the current one. 

Any sequence of actions which achieves the minimal cost constitutes 

an ‘optimal policy’ with respect to the initial state cost matrices C and 

D, event horizon n, and discount factor a. 

Suppose that the action taken in state (f,j) is Ofc. If the next request is for 
information stored at node I (which according to (Ej) occurs with probability 
then the next state will be (fc, Z), and the minimal cost of the next n — 1 
requests will be aVkj{n — 1). Hence, the quantities Vij{n) satisfy the following 
recurrence relations: 



Thus, starting from the initial values Vij{0) = 0 {i, j = 1,2, . . . , N), one can 
compute Vij{n) in n steps. Note that, if Vij(jn — 1) has already been computed 
for some m (1 < m < n), and for all i and j, then the complexity of computing 
Vij{m), for a particular state (i,j), is on the order of O(fV^). The best action 
to take in that state, and for that m, is indicated by the value of k that achieves 
the minimum in the right-hand side of ©• There may be several equally good 
actions. Since there are N'^ states altogether, the overall computational com- 
plexity of solving a and determining the optimal agent allocation policy over 
a finite event horizon of size n, is on the order of 0{N'^n). 

3 Infinite Horizon Optimization 

If the discount factor a is strictly less than 1, it makes sense to consider the 
total minimal expected cost, Vij, of satisfying the current request and all future 
requests, given that the current state is (i,j)- That cost is of course infinite 
when 0 = 1, but it is finite when a < 1. Indeed, in the latter case it is known 
(see DP), that under certain rather weak conditions, Vij{n) Vij when n 
00 . Moreover, there exists an optimal stationary policy (i.e. one whose actions 
depend only on the current state) which satisfies the Dynamic Programming 
equations. 



N 




mm 

Kfe<Af 
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An argument similar to the one preceding leads to the following equation 
for Vij: 



Vi i = min 
l<fe<Af 



N 

di^k + Ckj + Oi <}j,iVkj 
1=1 



( 5 ) 



The optimal policy (i.e. the best action in any given state) is specified by the 
value(s) of k that achieves the minimum in the right-hand side of ([5|. 

Either an exact or an approximate solution of the optimality equation © 
can be obtained by applying the ‘policy improvement’ algorithm (see Dreyfus 
and Law m- This iterative algorithm can be applied to the present problem as 
follows. 

Step 1. Start by making an initial guess about the optimal policy, i.e. con- 
struct an initial mapping, /(•, •), from system states to action indices. In the 
absence of other information, one could choose the action that minimizes the 
cost of the current operation only: 



j) — I di^k T — di_s Csj , s A:} . 

Step 2. Treating the current guess, /(•, •), as the optimal stationary pol- 
icy, compute the corresponding discounted costs, V/j, by solving a set of 
simultaneous linear equations: 



N 



'^i,j — dt,f{t,j 









( 6 ) 






Step 3. Try to ‘improve’ policy /. For every state find the action, 

which achieves the minimum value in 



min 

l<fc<iV 



N 

di,k+Ck,j+aJ2dj,iVh 

1^1 



( 7 ) 



In other words, minimize the total cost in state assuming that after the 

current operation, policy / will be used. 

Step 4. If k*{i,j) = f{i,j) for all then the policy /(•,•) cannot be 

improved; it is optimal. Otherwise, the next guess for the optimal policy is 
/(•, •) = k*{-, •). Repeat from step 2. 

The computational complexity of this algorithm is determined by the com- 
plexity of each iteration, which is dominated by step 2, and by the number of 
iterations. There may be better ways of solving the N'^ equations in step 2 than 
Gaussian elimination. Those equations can be represented in matrix form as 



V = A + aQF{V), (8) 

where V is the matrix of the N'^ unknowns, A is the matrix of single-request 
costs in the right-hand side of JSD, Q is the request transition matrix ([21), and 
F(V) is an appropriate rearrangement of the the elements of V. There is an 
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obvious iterative schema for solving start with an initial approximation to 
V, e.g. Vq = A, then at the nth iteration compute 



Vn = A + aQF{Vn-i) . 



Since Q is a stochastic matrix and a < 1, this schema converges geometrically. 
Moreover, the smaller the value of a, the faster the convergence. The iterative 
solution is likely to be more efficient than Gaussian elimination, unless a is very 
close to 1. 

3.1 Numerical Results 

Several models were solved using the algorithms described in this section, in 
order to illustrate the influence of various parameters on the optimal allocation 
policy. 

Example 1. A finite horizon optimization is carried out for a network with 5 
nodes. The cost of any local operation is 1, and the cost of any remote operation 
is 5. The cost of moving the agent from one node to another is 10. Requests 
move among the nodes according to the following transition probabilities: 



The discount factor is a = 1 (i.e., no discounting). 

This example illustrates how, despite the fact that Vij{n) —>■ oo when n — > oo, 
the optimal policy converges quite quickly. For n = 1,2, the best action is to 
leave the agent where it is; if n > 3, it is best to move the agent to the node 
of the request. Note that this refers to what is best in the case of the current 
request; the optimal policy is not stationary. Suppose, for instance, that n = 5. 
When serving the first three requests, the agent should move (if not already at 
the right node); after that, since the remaining horizon is 2 or less, it should 
remain static. 

Example 2. This concerns a discounted infinite horizon optimization in 
a simple 8-node network. As in the previous example, the cost of any local 
operation is 1, and the cost of any remote operation is 5; the cost of moving the 
agent from any node to any other node is now 50. The pattern of requests is 
such that node 8 is more popular than any other node: 



Now there is an optimal stationary policy, which depends on the value of the 
discount factor. When a is less than about 0.975, the optimal policy is to keep the 
agent static at whichever node it is placed originally, regardless of the location 
of the request. For larger values of a, the policy becomes: ‘if the request is at 
node 8 and the agent is not, move the agent to node 8; otherwise keep it static’. 

Example 3. Consider a more complex network of 16 nodes, grouped into 4 
clusters of 4 nodes in each cluster (figure 1). 




0.9 if i = j 
0.025 if ■ 
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Fig. 1. A 16-node network with 4 clusters 



The cost structure in this network is as follows. 

f 1 if * = J 

d j = < 5 if z and j are at different nodes of the same cluster . 

I 10 if z and j are in different clusters 

r 0 if z = J 

dij = < 25 if z and j are at different nodes of the same cluster . 

I 50 if z and j are in different clusters 

The request transition probabilities are given by 

J 0.235 if z and j are in the same cluster 
\ 0.005 if i and j are in different clusters 

A discounted infinite horizon optimization again leads to different policies for 
different values of the discount factor. When a is less than approximately 0.95, 
the best policy is to keep the agent static. For larger values of a, the optimal 
policy has the following form: if the agent and the request are at nodes belonging 
to the same cluster, keep the agent at its current node; as soon as the request 
moves to a node in a different cluster, move the agent to that node. 

Complexity of the Solutions. Determining the optimal policy over a finite 
horizon is simple and fast. To give some idea of the times involved, the 8-node 
system in example 2 can be solved for a horizon of zz = 20 in 0.02 seconds on 
a Pentium 350 MHz processor. For the 16-node network in example 3, a finite 
horizon problem with n = 20 is solved in 0.3 seconds. In comparison, a direct 
application of the policy improvement algorithm took 0.33 seconds for example 
2, and 1.3 seconds for example 3. This suggests that in some cases it may be 
numerically more efficient to treat the infinite horizon optimization as a limit, 
n ^ oo, of the finite horizon one. However, that aspect of the solution requires 
further study. 
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4 Analysis of a Special Case 



Here we examine the optimal allocation policies in a system where both the cost 
structure and the pattern of demand are symmetric. More precisely, suppose 
that the cost of performing an operation depends on whether the latter is local 
or remote, but does not depend on the particular locations of information and 
agent. In other words, the matrix C has the following form: 



Ci, 



Co if i = j 
Cl if i yf j 



(9) 



Similarly, assume that the cost of moving the agent does not depend on the 
source and destination: 



dij 



0 i = j 
d\ii^ j 



(10) 



The target of any request is, with a given probability, at the same node as 
that of the previous request; otherwise it is equally likely to be at any other 
node: 

This set of assumptions is quite reasonable. Of course, the problem is inter- 
esting only when the constants cq, ci, d and q satisfy: 0 < cq < ci ; d > 0 ; 
0 < <7 < 1. Otherwise the optimal policy is obvious. 

In this model, all system states of the form {i,i) are equivalent. Indeed, as 
long as the agent and the target of the request are at the same node, the identity 
of that node does not matter. Similarly, all states of the form (t, j), with i ^ j, 
are equivalent. Hence, we can reduce the N"^ system states to two — 0 and 1. 
State 0: the agent and the target of the request are at the same node; 

State 1: the agent and the target of the request are at different nodes. 

Denote the corresponding infinite horizon minimal costs by Vq and Vi, re- 
spectively. 

In state 0, there are two distinguishable actions: 

(Z) leave the agent in place and perform the operation locally; 

(to) move the agent to another node (it does not matter which) and perform the 
operation remotely. 

In state 1, there are three distinguishable actions: 

(Z) leave the agent in place and perform the operation remotely; 

(toi) move the agent to the node required by the request and perform the oper- 
ation locally; 

(m 2 ) move the agent to another node and perform the operation remotely. 

If action (Z) is taken in state 0, or action (toi) is taken in state 1, then the next 
state will be 0 with probability q and 1 with probability 1 — q; the subsequent 
discounted cost will be 

Vo = a[qVo + (1 - q)Vi] . (12) 



If action (to) is taken in state 0, or action (Z) is taken in state 1, or action (m 2 ) 
is taken in state 1, then the next state will be 0 with probability (1 — 9 ) /(N — 1) 
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and 1 with probability 1 — (1 — 9 )/(iV — 1). The subsequent discounted cost will 
be 



v\ = a 



1-g 

1 



Vo + {l 



1-g 
TV- 1 




(13) 



The optimality equations for Vq and V\ can now be written as follows: 



Vo = min[co + uo , d+ cx+vi] (14) 

Vi = min[ci + , d+ cq + vq , d + ci+vx] (15) 



We notice immediately that the third alternative in the right-hand side of 
m is always more expensive than the first. Since action (m 2 ) taken in state 1 
can never be part of an optimal policy, that alternative can be discarded and 
m can be rewritten as 



Vi = min[ci -I- , d -I- cq -I- 'Co] ■ (16) 

Thus the optimal stationary policy is one of the following four policies. 

Policy 1: Take action (/) in state 0 and action (1) in state 1. 

Policy 2: Take action (I) in state 0 and action (mi) in state 1. 

Policy 3: Take action (m) in state 0 and action (Z) in state 1. 

Policy 4: Take action (m) in state 0 and action (mi) in state 1. 

In fact, two of these candidates can be eliminated from consideration. 

Proposition 1 Policies 3 and 4, where aetion (m) is taken in state 0, cannot 
be optimal. 

Although this is quite an intuitive result (why should one wish to move the 
agent when it is in a position to perform a local operation?), it is not entirely 
obvious. For instance, there are cases where policy 3 is better than policy 2. 
Proof. Policy 4 is easily disposed of. Indeed, suppose that it is optimal. That 
would mean that in the right-hand side of (I14II . 



Co + Vo > d + Cl + vi , 

and in the right-hand side of disi), 



Cl -I- ui > d -I- Co -I- uo ■ 



However, these two inequalities cannot hold simultaneously if d > 0. 

Next we show that policy 3 cannot be optimal either, because it is always 
inferior to policy 1. 

Under policy 1, the total costs satisfy the equations 



Un 



( 1 ) 



= Cl -I- rii = Cl -I- Of 



di) 

'0 

1-g 

A^- 1 



0 = Co -k 1^0 = Co -k a[gUo^^ + (1 - qWi ^'’] . 



+ (1 - 



1-g 
N- 1 



)vl 



( 1 ) 



(17) 

(18) 
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(The superscript in parentheses is used to indicate the policy). The solution of 
these equations is given by 



y(i) _ [“(1 -q) + {N - 1)(1 - a)]co + {N - l)a(l - q)ci 

° (1 _«)[«(!_ 5) + (AT _l)(l_ag)] 

( 1 ) _ g(l - q)cQ + {N - 1)(1 - aq)ci 
^ (1 - a)[a{l -q) + {N - 1)(1 - aq)] ' 

Under policy 3, the equations are 



(19) 

( 20 ) 



=d+ci+a 



1-9 



= Cl 



N-l'° 
1-9 tU3) 



+ (1 - 



1-9 

N-1 



)VI 



(3) 



iV- 1^° 



+ (1 - 



Their solution is 



y(3) _ 

1/g — 



U/3) ^ 



{N — l)ci + [a(l — q) + {N — 1)(1 — a)]d 
(lV-l)(l-a) 

{N — l)ci + a(l — q)d 
(lV-l)(l-a) 



( 21 ) 

(22) 



(23) 

(24) 



Suppose, for instance, that Substituting (l24l) and (1^ into this 

inequality and carrying out straightforward simplifications, leads to 



{N - l)a(l - g)(co - Ci) > a(l - q)[{N - 1)(1 - aq) + a{l - q)]d . 



However, this is impossible because the left-hand side is negative, whereas the 
right-hand side is positive. 

A similar contradiction is obtained by supposing that This 

completes the proof of lemma 1. 

Thus, the optimal policy for any parameter setting is either policy 1 or policy 
2. When the target of the request and the agent are at the same node, the latter 
should not move. Whether or not it is better to move the agent to the requested 
node when it is not there already, is indicated by the following. 



Proposition 2 Policy 1 is optimal if 



Cl — Co 



d 



< 1 — a 



Nq-1 
N-1 ' 



Otherwise policy 2 is optimal. 



Proof. The cost equations for policy 2 are 

^ 

U/') = d + CO + a[qVo^^^ + (1 - q)Vl^^] . 



(25) 



(26) 

(27) 
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Their solution is given by 

t^( 2) co + a(l-g)fi 
" 1 - a 

( 2 ) _ Co + (1 - aq)d 
^ 1 — a 

The proof is completed by comparing these expressions with the corresponding 
ones under policy 1, (fTTJl) and (1^ . and carrying out simplifications. It is not 
difficult to see that each of the inequalities and is 

equivalent to the inequality (12511 . which establishes the proposition. 

Proposition 2 quantifies the intuitive notion that if the ‘gain’ of performing 
an operation locally rather than remotely is small relative to the cost of moving 
the agent, then the latter should remain in a fixed position throughout; if that 
relative gain is large, then the agent should follow the requests as they move from 
node to node. Exactly how large the gain should be, is specified by the threshold 
in the right-hand side of (l25ll . Note that the latter is a non-increasing function 
of the parameters a and g, and a non-decreasing function of N . In particular, 
in a large network {N oo), the relative gain beyond which it is worth moving 
the agent is approximately equal to 1 — aq. 

In practice, the cost of moving the agent and then performing a local opera- 
tion is likely to be greater than the cost of a remote operation: cq -I- d > ci. In 
other words, the left-hand side of (l25l l is likely to be less than 1. On the other 
hand, there are values of q for which the right-hand side of (1251) is bigger than 1 
(g < 1/-^)- In those cases, the best policy is to keep the agent static, regardless 
of a. 

5 Conclusion 

We have shown that the problem of deciding how to allocate a software agent in 
the presence of a stream of requests addressed at different nodes can be tackled 
by dynamic programming methods. For a general system, the optimal policy can 
be determined numerically in the context of either a finite or an infinite event 
horizon. In the important special case discussed in section 4, the optimal policy 
is characterized explicitly. 

The model described here can usefully be generalized in several directions. 
For example, there may be more than one agent available to serve requests. As 
far as a general numerical solution is concerned, the same methodology would 
apply, but with a considerably higher computational complexity. The state of a 
system with m agents is described by an m-|- 1-tuple, where is 

is the position of agent s, and j is the target of the current request. The number 
of actions possible in each state is also much larger: any of the agents can move 
to any node. However, in the special case of section 4, a reasonable conjecture is 
that a result similar to proposition 2 will continue to apply, perhaps with a larger 
threshold (the more agents are scattered among the nodes, the less advantageous 
it appears to move any of them). 



( 28 ) 

(29) 
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Another possible generalization is to allow more than one type of users, gen- 
erating different patterns of requests. This can also be handled by substantially 
increasing the state space. Alternatively, one could approximate the merged se- 
quence of all user’s requests by a single Markov chain, thus reducing the problem 
to the present one. 
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Abstract. Conventional algorithms for the steady-state analysis of Mar- 
kov regenerative models suffer from high computational costs which are 
caused by densely populated matrices. In this paper a new algorithm is 
suggested which avoids to compute these matrices explicitly. Instead, a 
two-stage iteration scheme is used. An extended version of uniformization 
is applied as a subalgorithm to compute the required transient quanti- 
ties “on-the fly” . The algorithm is formulated in terms of stochastic Petri 
nets. A detailed example illustrates the proposed concepts. 



1 Introduction 

We consider the stationary analysis of Markov regenerative models. We assume 
that the model is specified by a stochastic Petri net (SPN) [Tlj, although other 
modeling formalisms would be possible as well. Certain attention has been paid 
to the numerical analysis of SPNs in which the firing times can have a general 
distribution (general transitions). Transitions with an exponentially distributed 
firing time are referred to as exponential transitions. Under the restriction that 
the general transitions are mutually exclusive, the underlying stochastic process 
is a Markov regenerative process (MRP) and the corresponding theory can be 
used for the analysis wm- A special case is the class of deterministic and 
stochastic Petri nets (DSPNs) [15] . where the general transitions have a deter- 
ministic firing time. 

In the stationary case, the analysis leads to the construction of an embedded 
discrete-time Markov chain (EMC). In the standard analysis approach, first the 
stochastic matrix P of the EMC and also a conversion matrix C is computed. 
P is used to compute the stationary solution of the EMC and C is used to 
convert the solution to the fraction of time the MRP spends in its states between 
two time instants at which the EMC is embedded. The latter is equal to the 
stationary solution of the MRP. Solution components based on this algorithm 
were implemented in the tools DSPNexpress m and UltraSAN [19] (restricted 
to DSPNs in both cases) and in TimeNET (for SPNs with general transitions). 

As described in [T2], a problem of the algorithm is the fill-in of P and C, 
leading to a high space and time complexity. The reason is the following. The 
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exponential state transitions when general transitions are enabled are described 
by subordinated continuous-time Markov chains (CTMCs). P needs the condi- 
tional state probabilities of the subordinated CTMCs in the instant of firing 
and C needs the conditional expected sojourn times up to the firing of the gen- 
eral transitions. To obtain these quantities, a transient analysis and cumulative 
transient analysis of the subordinated CTMCs with respect to every possible 
initial state has to be performed. This leads to dense portions in P and C and 
therefore to high memory requirements and also to long execution times for its 
computation. The dense structure of P is also responsible for long execution 
times of the solution of the EMC. One improvement has been suggested in m- 
in a two-stage iterative method parts of the matrices are stored on disk, reducing 
main memory requirements and also reducing execution times in certain cases. 

Other attempts to improve the efficiency of the computation of the two ma- 
trices is to try to make use of the structure of the subordinated CTMCs. m 
suggest to use closed form solutions for special structures and to look for iso- 
morphisms between strongly connected components (SCCs) of the subordinated 
CTMCs. This leads often to a reduction of the execution time but can also add 
to it if no such structures are found. However, this approach eventually allows 
to compute the matrices faster, but it does not avoid the problem of fill-in with 
its bad implications on space and time complexity. In m a similar approach 
was suggested to be performed on the net level. 

In this paper a method is proposed in which a storage of P and C is not 
needed at all. As prerequisites we need insight into the structure of the P and 
C matrices and an iterative method for computing the transient quantities of 
the subordionated CTMCs. It is then possible to describe a two-stage iteration 
scheme for the stationary analysis. The outer iteration computes the EMC so- 
lution and the inner iteration performs the transient analysis and cumulative 
transient analysis of the subordinated CTMCs (for the current EMC solution 
vector only). As a consequence, space complexity is drastically reduced. For large 
models the time complexity is reduced as well. This iterative solution algorithm 
was already shortly described in [3] . 

The paper is organized as follows. In the next section the conventional anal- 
ysis procedure based on an embedded Markov chain is reviewed. The iterative 
procedure for computing the transient quantities of the subordinated CTMCs, 
referred to as uniformization is discussed in Sec. El In the following Section 2] 
it is possible to present the iterative analysis algorithm. Complexity issues are 
then discussed in Sec. Eland an example is given in Sec. El 



2 The Conventional Analysis Procedure 

As a modeling formalism, SPNs are considered in which the general transitions 
are mutually exclusive. The preemption policy of all transitions is preemptive 
repeat different (prd) [21j . also referred to as race enabling [H. The iterative 
approach presented in the next sections can also be extended to the case of pre- 
emption policy preemptive resume (also known as race enabling) and to marking- 
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dependent firing time distributions, as long as appropriate matrices P and C can 
be found. 

We adopt the notation of |8I4| . Let G be the set of general transitions and 
g a single general transition. The firing time distribution of g is F^(x), and 
F^{x), and x^^^ denote its complementary distribution function, density 

function, and maximum support, respectively. Let S denote the finite set of states 
of the MRP and the exponential states (no general transition enabled), 5® 
the states in which g is enabled, and S'^ = UggG general states. After 

a reachability analysis, the structure of the MRP is represented by the three 
matrices Q, Q, and A defined as: 

— the non-diagonal entry qij, i ^ j, is the rate of the exponential state tran- 
sitions from i to j which do not preempt a general transition, the diagonal 
entry qa is the negative sum of all rates of exponential state transitions out 
of state i (including those which preempt a general transition), 

— (jij is the rate of the exponential state transitions from i to j which preempt 
a general transition, 

— Sij is the branching probability, the probability that the firing of a general 
transition g leads to state j, given that g fires in i. 

To be able to refer to portions of these matrices without worrying about the 
ordering of the states we introduce the following filter concept. I denotes the 
identity matrix and the filter matrix I® is the same matrix where all rows which 
do not correspond to states of are set to zero. A multiplication of Q with I-® 
from the left sets all rows to zero which do not correspond to states of . As 
a shorthand notation we use = I^Q. The notation is used analogously for 
the other matrices and subsets and also for vectors, e.g., A®, etc. 

The subordinated CTMC of g is described by the generator matrix Q® . The 
generator matrix is defective in case of preemptions. The conditional state prob- 
abilities of the subordinated CTMC in the instant when g fires are collected in 
the matrix fJ®. Similarly, the conditional expected sojourn times in the states 
of the subordinated CTMC from the enabling to the firing of g are collected in 
the matrix 

r^max 

= p / e^ y^x)dx, 

Jo 

nX^ 

/ max _ 

= e^^F3{x)dx, 

Jo 

(I® eliminates the ones at the diagonal entries of states not in 5®), summation 
over all general transitions yields: 






E 

geG 



^9 



Based on ft and the stochastic matrix P of the EMC is given as 
P = - diag-i (Q^) Q^ + ^IA + ’®'Q, 



( 1 ) 
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as shown in |8I4| . The first two terms on the right side represent the EMC for the 
exponential states embedded at the instant of firing, the third term represents 
the firing of a general transition and the last one its preemption. The operator 
diag”^ (•) denotes the inverse of the diagonal matrix of the operand restricted 
to non-zero diagonal elements. The conversion matrix C is 

C = -diag-^ (Q^) + (2) 

In terms of these definitions, the standard algorithm for the stationary anal- 
ysis is: 

1. computation of SI and ’®', 

2. computation of the stationary solution of the EMC by the solution of the 
linear system uP = u subject to the normalization condition ue = 1 (e is a 
vector of ones), and 

3. conversion of the EMC solution by c = uCe, v = ^u, tt = vC and (p — vS7 
in order to get the state probabilities tt and firing frequencies (p. (c represents 
the mean time between the time instants at which the EMC is embedded, 
ip is the vector of firing rates of the general transitions in each state) . 

The transient analysis in the first step can be performed by an extended version 
of uniformization for each general state. Note that it has to be repeated for 
each general state as an initial state. The second step is accomplished by an 
arbitrary method for solving Markov chains. For models with large state spaces 
an iterative method such as Gauss-Seidel or Successive Overrelaxation (SOR) 
1201 should be used. As described in the introduction, the matrix exponentials 
lead to dense matrices and and also to dense matrices P and C. This causes 
high memory requirements and also long execution times since the iteration in 
step 2 is performed over a dense matrix. 

3 Uniformization for General Distributions 

In the first step of the stationary solution algorithm presented in the last section 
the integrals of the matrix exponentials have to be computed in order to get the 
matrices and (in this section we omit the superscript g for notational 
convenience and write just and ’ 3 /). The integrals are taken with respect to 
the density and the complementary distribution functions of the general transi- 
tions. Uniformization (also known as randomization) [2()IJ is known as an iterative 
method for the transient analysis at a deterministic time. 

A straightforward method for the computation of the required integrals would 
be to perform the transient analysis at successive discrete steps and to apply a 
quadrature rule such as the Simpson rule. The computation can for instance be 
performed by repeating uniformization for each step or by applying a Runge- 
Kutta method. This approach is however only recommended in stiff cases {qt is 
very large). Alternatively, it is possible to extend uniformization to the computa- 
tion of the integrals. As a result, an iterative algorithm is defined with which an 
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arbitrary accuracy can be achieved. The basic idea of extending uniformization 
is due to Grassmann and algorithmic formulations were published in |7I2| . 
An improved formulation can be found in this section highlights the main 
results. 

First we introduce the concept for a general function g(x) and specialize it to 
more concrete functions later. Let g{x) be a real-valued function for which the 
integral from zero to infinity is absolutely bounded: g(x) dx = c, |c| < oo. 

Define g > Tnax.i\qu\, the Poisson probabilities P{k,qt) = and the 

stochastic matrix A = iQ -1- 1. Further, define the following values 



pOO 

ag{k,q)= / (3{k,qx)g{x) dx, A: G IN, 

Jo 



( 3 ) 



referred to as ag-factors. Then, the following equation can formally be derived: 

pOO pOO 

/ e^^g{x)dx= / A^/3(fc, qx)g{x) dx = A^o;g(fc, q). 

fc=o fe=o 

Now consider the special case that the function is equal to a positive density 
{g{x) = f{x)) or to the complementary distribution function (g{x) = F{x)). The 
requested matrices and are then given by: 

OO OO 

^ = A'^afik, q), A^ap{k, q). 









The factors are referred to as a /- and cn^-factors, respectively. It can be shown 
that the a^-factors sum to c, are bounded by c, converge to zero for large A:, and 
that the a^-factors can be computed from the corresponding a/-factors: 



1 / ^ \ \ ^ ^ 

ap{k,q) = - ll-'^af{n,q)\ = - af{n,q). 

^ \ n—O / ^ n—k+1 

Truncation of the sum of Cl leads to: Cl fv Cl = YJkk=L ^^otf[k,q), for which 
the error term T = jS! — r2| is bounded by: 



R 



<^~Y «/(^> 9 )- 



k=L 



Applying left and right truncation to ’3/ yields: 



’5' 



^ L— 1 ^ R / k \ 



q 



q 



k—0 k—L 

The error term T = IfF — fFI is bounded by: 



i—L 



R 



1 



Ty < A - y]] - ( 1 - y]] a/(n, g) ) . 






'i—L 



( 4 ) 



( 5 ) 



( 6 ) 
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Tij is bounded by: 



O I -1 ^ t ^ 

r,j < ——'^af{n,q)+ ^ ap{k,q). (7) 

^ fc=0 k=R+l 

For avoiding fill-in the vectors ufJ and u’®' can be computed by the scheme 

R R 

un^Y, ^{k)af{k,q), u’®' « ^ $ {k)ap{k,q), 

k=L k=L 

where 4*(0) = u, 1) = ®>(/c)A. 

The tails occurring in Equations and m can be bound by appropriate 

Poisson tails. The right ay-tail is bounded by the right Poisson tail in Xmax: 

OO OO 

Y 0!f(k,q) < Y P(k,qxma.^), 

k=R+l k=R+l 

The right ay^-tail is bounded by the the right Poisson tail in Xmax multiplied 
with the expected firing time: 

OO OO 

Y Oip{k,q)<X Y l^{k,qxma^), 

As a result truncation points satisfying a prespecified error tolerance e for the 
a/- and ay--factors can be determined without the actual computation of the 
factors. 

We also conclude that the computational complexity for uniformization with 
respect to a random time is the same as for uniformization with respect to a 
deterministic time, when the deterministic time t is replaced by the maximum 
support Xniax of the distribution. Let N be the dimension of Q and rj its number 
of non-zero entries. For the computation of the vectors uSl and u’®' the space 
complexity is of order 0{r]) and the time complexity of order 0{rjqXmax)- For 
the computation of the full matrices SI and ’®' everything has to be performed 
for each possible initial state. Due to fill-in the space complexity is 0{N^) and 
time complexity is 0{Nr]qXmax)- 

The remaining question is how the a-factors actually can be computed. In 
m iterative formulas were presented for the a-factors of expolynomial distri- 
butions (distributions which can be composed piecewise by polynomials and ex- 
ponential polynomials). A refinement of these iterations assuring that no under- 
and overflow can happen is given in [Q. Other distributions can also be dealt 
with if the a-factors are known. One example is the Pareto distribution. 

In the following the a-factors are given for an example. We consider a triangu- 
lar distribution f{x) = xR{0, 1] -I- (2 — 2] from zero to 2. The complemen- 
tary distribution is given by F{x) = i?(0, 1] -I- ^2 — 2x -I- i?(l,2], 

the density and the complementary distribution are plotted in Figure [T] 
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Fig. 1. f{x) and F{x) for triangular distribution from 0 to 2 



Figure [21 shows the a/- and Op-factors. q is varied from 10 to 1000 and the 
precision is set to e = 10“^. The values of the right truncation points are: R = 41 
for g = 10, i? = 255 for q = 100, and R = 2144 for q = 1000. 




Fig. 2. af{k, q) and ap{k, q) versus k {q = 10, 100, 1000, triang. from 0 to 2) 



4 An Iterative Algorithm for Fill-in Avoidance 

A purely iterative solution method for MRPs can be organized in two stages. In 
outer iterations the EMC is solved. The stochastic matrix P of the EMC needs 
not to be computed explicitly, since its structure is known in matrix terms. In 
inner iterations uniformization is applied to the current solution vector of the 
EMC. 

In the outer iteration the stationary solution u of the EMC is computed by 
the power method m 

u(to + 1) = u(m)P. (8) 

u(0) is an appropriate starting vector, e.g., the initial distribution according 
to the initial marking, the discrete uniform distribution over all markings, or a 
vector with normalized entries obtained from a random number generator. 1201 . 
pp. 155-156, describes advantages/disadvantages for such starting vectors for 
iterative methods. The iteration can be performed without having the stochastic 
matrix P explicitly available. Insertion of the known structure of P into the 
iteration formula of the power method {HI) yields 
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u(m + 1) = u(m)P 

= u(m) (I^ - diag"^ (Q®) Q^ + ^IA + ’®'Q) 

= u^(m) — u^(m)diag“^ (Q’^) + u‘^(m)r2A + u‘^(m)’®'Q. 

Defining vectors a(m) and b(m) as 

a(m) = u'^(m)r2 = u®(m)r2®, b(m) = (9) 

g&G g&G 

leads then to the iteration formula 

u(m + 1) = u^(m) — u^(m)diag“^ (Q’^) + a(m) A + b(m)Q. (10) 

Each application of Formula dnn requires the computation of the vectors a{m) 
and b(?7i). This can be done iteratively by uniformization. Note that here uni- 
formization has only to be applied to the current solution vector u(m) and not 
to a matrix, opposed to the conventional version of the solution algorithm. 

Given that the embedded DTMC is irreducible (or contains at most one 
recurrent class) and aperiodic, the power method is guaranteed to converge. 
Testing of convergence can be based on the maximum difference of successive 
iterates. See ED], pp. 156-159, for a detailed discussion of such convergence tests. 
Assume that a good estimate of the stationary solution is obtained after the Mth 
iteration, i.e., u « u(M). The normalization constant c is computed as 

c = uCe R:! — u^(M)diag~^ (Q^) ® + u‘^(M)’®'e 
= -u^(M)diag-i (Q^)e + b(M)e, (11) 

such that V = ^u. The state probabilities and firing frequencies are then obtained 

by 

7T = vC « — -u'®(M)diag“^ (Q^) + ~b(M), (12) 

c ' c 

(f = vfJ R:! -a(M). (13) 

c 

The following is a summary of the iterative algorithm: 

1. computation of the a-factors for all general distributions, 

2. outer iteration over u(m) according to Equation m, the vectors a(m) and 
b(m) are computed in inner iterations by uniformization, 

3. normalization of u by c according to Equation dm), and 

4. backsubstitution of v according to Equations m and m- 

This algorithm has also been implemented in SPNica [1| and its principal be- 
havior has been investigated. Currently an implementation based on sparse date 
structures is under development and will be included into TimeNET [B|. To lower 
the number the number of inner iterations, the accuracy of uniformization can 
be low in the first outer iteration steps. Steady-state detection simular to HB| 
can also help to reduce the overall number of iterations. The inner iterations 
are also well suited for a parallelization: the exponential parts and the SCCs of 
the subordinated CTMCs can be distributed. The inner iterations can then be 
performed concurrently for each part. 
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5 Complexity Analysis 

A comparison of the asymptotical complexity of the two solution algorithms is 
given. We assume that both algorithms are implemented by using sparse data 
structures in which only the non-zero entries are stored and used in the op- 
erations. Let r]^ be the number of non-zero entries of Q^, 77 ® the number of 
non-zero entries of of general transition g, and i?® the right truncation 

point in the computation of the a-factors for transition g. As discussed in Sec. 
El uniformization applied to a vector has space complexity 0{rj^) and time com- 
plexity 0{r]^R^), where 0{R^) = 0{q^x^^^. and are the number 

of iterations required for the SOR and power method, respectively. 

The space complexity of the conventional algorithm is 0{r]^ + 

(storage of Q® and of the dense portions due to the integration of the ma- 
trix exponentials) and the time complexity is 0 (X)gGG \S^\rj^R^ -I- + 

general transition uniformization is repeated for each 
state of 5® and iterations are performed over the dense matrix P). 

In the iterative solution algorithm the space complexity is 0{g^ + J^gec 
(only sparse matrices and vectors have to be stored) and the time complexity is 
0{M^^ {rj^ + J2g^G repetitions of uniformization). 

For a comparison of both algorithms some simplifications are introduced. 
The distinction of the different general transitions is dropped and it is assumed 
that the order of the general states is of the order of the state space: OdiS*^!) = 
0(|5|) with N = |5|, g = ^ right truncation point 

of uniformization. Furthermore we assume = 0{M^^) and omit the 

superscript. 

In the standard algorithm, the space complexity is then roughly 0{N'^) and 
the time complexity is 0{NgR+ MN'^). In the iterative algorithm, the space 
complexity is 0{g) and the time complexity is 0{MgR). 

As one can see, the space complexity of the iterative approach is always bet- 
ter. The time complexity of the iterative approach is dominated by the product 
RM of the number of inner and outer iterations. Opposed to that, R and M 
are in additive terms in the expression for the standard approach. The iterative 
approach is therefore computationally intensive if the numbers of both iterations 
are large. The single term MgR is smaller than the first term in NgR + N'^M 
\i M < N and smaller than the second term if i? < N'^ /g. Both conditions are 
likely to be satisfied in large models. 

6 Examples 

Figure El shows an SPN model adapted from [TD] which models a mechanism for 
the management of packet switching in connection-oriented networks, referred to 
as on-demand connection with delayed release (OCDR). The transitions arrival 
and service model arrival and service of packets, connect and release model 
the setup and release of a connection. The model can be used to study the 
tradeoff of the mean waiting time and bandwidth utilization for different values 
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Fig. 3. SPN model of the OCDR system 



of the connection release timer. According to Little’s law, the mean waiting time 
is given hy W = E[^buffer + #busy\/E[^ service]. The utilization is the ration 
of used and requested bandwidth U = E[l^^notconnected=o}]- 

For illustration purposes we assume that the packets are generated according 
to a Poisson process with a rate of A = 100 per second and that their length 
is uniformly distributed from 100 to 2000 Bytes. The bit rate of the connection 
is 10 Mbps, the time for setup is 10 ms and for release is 20 ms. Taking one 
second as the underlying time unit, the firing time of service is uniformly 
distributed from 0.00008 to 0.0016 and the firing times of connect and release 
are deterministically equal to 0.01 and 0.02. In order to give all data structures, 
we assume a buffer space K = 2. General transition release has prd-policy and 
is preempted when the exponential transition arrival fires. 

For notational convenience, the three general transitions service, connect, 
and release are in the following abbreviated by their first letters s, c, and r, 
respectively. The general firing time distributions are denoted as F^{x), 

and F'^(x). The set of general transitions is given by G = {s, c, r}. The marking 
process N{t) is defined by ^buffer + #busy + l{#noteormected=o}(^ + !)• For 
K = 2 the state space and its subsets are given by 5 = {0, 1, 2, 3, 4, 5}, = {0}, 

5® = {4, 5}, 5'^ = {1, 2}, (S'” = {3}, and 5*^ = {1, 2, 3,4, 5}. The state-transition- 
rate diagram of the marking process is shown in Figure [4l The preemptive state 
transition is shown as a thick arc. States with an active general transition are 
supplemented by the age variable x, general state transitions are labeled with 
the instantaneous rate which depends on the value of x at the origin of the arc. 

The three matrices Q, Q, and A are given by 



■ -A A 0 0 00 

0 -A A 0 0 0 

0 0 0 0 0 0 

0 0 0 -A 0 0 

0 0 0 0 -A A 

0 0 0 0 0 0 



Q = 



0 0 0 0 0 0 
0 0 0 0 0 0 
0 0 0 0 0 0 
0 0 0 0 A 0 
0 0 0 0 0 0 
0 0 0 0 0 0 



A = 



0 0 0 0 0 0 
0 0 0 0 1 0 
0 0 0 0 0 1 
1 0 0 0 0 0 
0 0 0 1 0 0 
0 0 0 0 1 0 



Q represents the preemption of the general transition release. All entries of the 
first row of A are equal to zero since it corresponds to an exponential state. It 
is obvious that the three matrices are sparse. 
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Fig. 4. Stochastic process of the OCDR model 



The maximum outgoing rate of the subordinated CTMC is in all cases q = 
100. The product of q and the maximum support of the firing time distributions 
is an indicator of the magnitude of the right truncation point R for computing the 
a-factors. In case of service, 100-0.0016 = 0.16, in case of connect, 100-0.01 = 
1, and in case of release, 100 - 0.02 = 2. In all cases the product is small and 
we do not expect that the number of required factors and thus the number of 
iterations is large. (In terminology of queuing systems, the light traffic case leads 
to a small number of iterations and the heavy traffic case to a large number 
of iterations.) In fact, when the precision is set to e = 10“^, we get i? = 5, 
R = 10, and R = 13, respectively. Table [U gives the result obtained by SPNica 
for all factors. (The a-factors for the deterministic transitions simplify to the 
Poisson probabilities and their integrals.) Columns 2 and 3 give the a-factors for 
service, the other the a-factors for the two deterministic transitions. The last 
row gives the actual truncation error. As expected, it is always smaller than e. 
Also, the error obtained for the aj--factors is always smaller than the error for 
the a /-factors in this example (this is not always but very often the case) . 

The stochastic matrix P of the EMC and the conversion matrix C are 

0 1 0 0 0 0 ■ 

0 0 0 0 0.3678 0.6321 

0000 0 1 

~ 0.1353 00 0 0.8647 0 

0 00 0.9203 0.07968 0 

0 0 0 0 1 0 



“ 0.01 0 0 0 0 0 " 

0 0.006321 0.003679 0 0 0 

^ 0 0 0.01 0 0 0 

~ 0 0 0 0.008647 0 0 

0 0 0 0 0.0007968 0.00004317 

_ 0 0 0 0 0 0.00084 

The state-transition-rate diagram of the EMC is shown in Figure [3 It is easy 
to derive that in general the number of states is given by 2K -|- 1, the number 
of state transitions in the reduced reachability graph (RRG, corresponding to 
the stochastic process shown in Figure |1|) is equal to AK + 1 and the number 
of state transitions of the EMC is equal to + K + Figure E] shows the 
number of transitions in the RRG and EMC versus the state space size and thus 
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Table 1. a-factors for the OCDR model 



k 


afs{k,q) 


aps{k,q) 


afo{k,q) 


apc{k,q) 


a;r {k,r) 


ap.{k,q) 


0 


0.920 


0.000796 


0.368 


0.00632 


0.135 


0.00865 


1 


0.0755 


4.15e-5 


0.368 


0.00264 


0.271 


0.00594 


2 


0.00398 


1.63e-6 


0.184 


0.000803 


0.271 


0.00323 


3 


0.000158 


5.17e-8 


0.0613 


0.000190 


0.180 


0.00143 


4 


5.03e-6 


1.37e-9 


0.0153 


3.66e-5 


0.0902 


0.000526 


5 


1.34e-7 


3.11e-ll 


0.00306 


5.94e-6 


0.0361 


0.000166 


6 


- 


- 


0.000511 


8.32e-7 


0.0120 


4.53e-5 


7 


- 


- 


7.30e-5 


1.02e-7 


0.00344 


l.lOe-5 


8 


- 


- 


9.12e-6 


1.12e-8 


0.000859 


2.37e-6 


9 


- 


- 


1. Ole-6 


l.lle-9 


0.000191 


4.65e-7 


10 


- 


- 


l.Ole-7 


l.OOe-10 


3.82e-5 


8.31e-8 


11 


- 


- 


- 


- 


6.94e-6 


1.36e-8 


12 


- 


- 


- 


- 


1.16e-6 


2.07e-9 


13 


- 


- 


- 


- 


1.78e-7 


2.93e-10 


error 


3.11e-9 


6.30e-13 


l.OOe-8 


9.00e-12 


2.93e-8 


4.41e-ll 



demonstrates that P gets dense. For 10® states the EMC has more than 10® 
state transitions, a formidable number for storage and iteration! 

Figure |7] shows the the convergence of the two measures W and U. The 
numerical values have been computed with the SPNica implementation of the 
iterative algorithm. The utilization U and mean waiting time W are plotted 
versus the number of iterations. As a starting vector, the discrete uniform dis- 
tribution was chosen and for convergence the maximum absolute difference of 
successive iterates was forced to be below 10“^. In the example, 62 iterations 
were required. The example also suggests that the order of magnitude of the 
measures is found by just a few iterations, say after 10 iterations. 



1 




Fig. 5. EMC of the OCDR model 
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states 



10000 100000 



Fig. 6. Transitions in RRG and EMC vs. state space 




Fig. 7. U and W vs. number of iterations 



To conclude this section we report on an empirical evaluation of the costs 
of the two algorithms. The iterative algorithm was compared with the standard 
algorithm as it is used in TimeNET. The experiments were conducted for the 
example SPN model shown in Figure IS] which is taken from [T7j and was also 
investigated in m- It is an artificial example where only transition Det 1 is deter- 
ministic and all others are exponential. The mean firing times of all transitions 
are set equal to one. All experiments were performed on a Sun Sparc Ultra 5 
workstation with 270 MHz and 128 MB main memory. In the standard algorithm 
SOR was used for the solution of the linear system and in both algorithms a 
precision of 10“^ was employed. Figure [9] shows the memory requirements and 
execution time of both algorithm variants versus the state space size when the 
value of N is varied. As expected, the iterative algorithm needs less memory. 
For small state space it is slower than the standard algorithm but this relation 
inverts for larger state spaces. It should however be noted that for other models 
the results are not always so clear as in this example. 



7 Conclusions 

It is clear that the memory requirements of the new iterative method are much 
better than those of the conventional method. The space complexity is thus 
reduced to that which one normally is willing to accept for Markovian models 
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Fig. 8. Molloy’s example SPN model 
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Fig. 9. Memory requirement and Execution time vs. state space size 



like GSPNs. The open question is about time complexity. More experimental 
studies of that issue are required. 

A main problem of the iterative solution algorithm is perhaps that the outer 
iterations are based on the power method which has a slower convergence than 
other iterative methods such as Gauss-Seidel and its variants. Since the entries 
of P are not directly accessible, it is at the moment not known how Gauss- 
Seidel could be used in this context. As another possibility of acceleration we 
are currently implementing a version of the iterative solution algorithm, where 
the inner iterations for each SGG of the subordinated GTMGs are performed in 
parallel. 
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Abstract. In order to extend their applicability to more complex situ- 
ations, in this paper we present a new approach for the analysis of non- 
Markovian Stochastic Petri Net (NMSPN) models, which is based on a 
discrete time approximation of the stochastic behavior of the marking 
process. The proposed approach, which resulted in a new modeling tool 
for the analysis of NMSPNs called WebSPN, allows to analyze a wider 
class of PN models with prd, prs and pri concurrently enabled generally 
distributed transitions. This implies the possibility of dealing with very 
complex systems with arbitrarily distributed events with very complex 
interrelations among each other. The adopted technique is described, an 
application example is solved and the results are carefully analyzed in 
order to demonstrate the validity of the proposed approach. 



1 Introduction 

Petri nets are commonly viewed as a valid tool for the qualitative and quantita- 
tive study of computer systems [2]. Over the years, many stochastic extensions 
to the basic Petri net model have been proposed. Dealing with non-exponentially 
distributed events is an extension that widened the field of applicability of this 
modeling approach to real situations. There are a great number of real circum- 
stances in which deterministic or generally distributed event times occur. Choi 
et al. have shown that the marking process underlying a Stochastic Petri Net 
(SPN), where at most one generally distributed transition is enabled in each 
marking, belongs to the class of Markov Regenerative Processes {MRGPs) [7]. 
Following the line opened in [7], different approaches have been proposed to deal 
with non-Markovian systems [8,14,16]. 

All the above literature on Markov Regenerative SPNs (MRSPNs) implicitly 
assumed an enabling memory policy (as it is defined in [1]). The transient anal- 
ysis of a class of NMSPNs with age memory policy ([!]) was provided in [5], and 
a preemption mechanism, different than the ones considered in [1], was intro- 
duced and analyzed in [4] . Following the common terminology used when dealing 
with queuing systems, the stochastic behavior of the transitions of a SPN model 
has been classified as preemptive repeat different (prd), preemptive resume (prs) 
and preemptive repeat identical (pri), respectively. These stochastic extensions 
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have increased the descriptive power of SPNs, as well as the computational ef- 
fort required for their solution. Many SPN modeling tools have recently been 
proposed or developed (e.g. ESP [11], GreatSPN [6], SPNP [9], DSPNExpress 
[16], TimeNet [13], UltraSAN [10]). Some of the above tools have also imple- 
mented the possibility of including some non-Markovian features thus extending 
the range of applicability of PNs. Their main limitations regard the kind and 
number of generally distributed firing time (GEN) transitions and their associ- 
ated preemption policy. A very limited number of simultaneously enabled GEN 
transitions is allowed. And usually it reduces to only one. Further, the preemptive 
repeat different (prd) policy is the only adopted. The preemptive resume {prs) and 
the recently proposed preemptive repeat identical {pri) policies [4], although very 
powerful, are basically not yet implemented. The first restriction can be relaxed 
by the analytical results available for the analysis of PN with non-overlapping 
prs general transitions [5] , and there is an active research to find the proper way 
to analyze PN with concurrently active general transitions [16,17]. 

A possible approach for the analysis of SPN models, with concurrently active 
prs and prd general transitions, is through the continuous time Phase type ( CPPt) 
approximation of generally distributed firing times [11]. With this technique, the 
marking process of the NMSPN is approximated by a continuous time Markov 
chain with an expanded state space [11]. 

In this paper, we discuss a modeling technique for the analysis of NMSPNs 
that relaxes some of the restrictions present in currently available SPN analy- 
sis packages. This approach is based on a discrete time approximation of the 
stochastic behavior of the marking process, hence it can be considered as a dis- 
crete time version of the phase type expansion technique. A similar approach can 
be found in [9], where Discrete Deterministic and Stochastic PNs (DDSPNs) are 
presented and race policies equivalent to our prd and prs policies are considered. 
The main differences with our approach consist in the intrinsic assumption of a 
discrete time model, the lack of the pri policy and the absence of a full imple- 
mentation of the proposed algorithm. A new modeling tool for the analysis of 
non-Markovian stochastic Petri nets, called WebSPN [15] has been successfully 
implemented. The approach we propose offers some new features which result in 
the possibility to analyze a wider class of NMSPN models. The main advantages 
of this method consist of the possibility to evaluate SPNs with transitions of 
pri type with finite firing time, besides the more traditional prd and prs, and 
to analyze models with concurrently enabled generally distributed transitions of 
any kind. 



2 Introducing Petri Nets and Preemption Policies 



A timed Petri net is a tuple PN={V,T,G,A,l,0,H,Mo) where: V is the set of 
places; T is the set of transitions; G is the set of random variables jg associated 
to transitions; A is the set of age variables Ug associate to transitions; X, O and 
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Ti. are respectively the set of input, output and inhibitor functions {I G V x T, 
OcT X V, H GVxT), providing their multiplicity; Mq is the initial marking^. 

The firing of an enabled transition tk, in a given marking Mi, generates 
another marking Mj. Mj is said directly reachable from Mi {Mi X?)- 

Starting from the initial marking Mq, the transitive closure of — > generates the 
reachability graph RG{Mq) (the set of all reachable markings from Mq). 

A consistent way to introduce memory into a SPN is provided in [1] and 
extended in [5]. Each timed transition tg is assigned a general random firing 
time 7 g with a cumulative distribution function Gg{t). A clock, associated to 
each transition, counts the time in which the transition has been enabled. An 
age variable Gg associated to the timed transition tg keeps track of the clock 
count. A timed transition fires as soon as the memory variable Og reaches the 
value of the firing time 7 g. 

A timed transition has to be characterized both in terms of the distribution 
function of the random firing time and also of its behavior when a preemption 
occurs. Thus, a preemption policy is required to fully describe the behavior of a 
timed transition. In this paper we prefer an informal approach to the definition 
of preemption policies through the example of Figure 1 b). 

The Petri net on Figure 1 b) models a server with exponential arrivals (tran- 
sition ti) and general service time (transition t 2 )- Waiting customers are repre- 
sented by the tokens in place P\. The server is randomly preempted by higher 
priority jobs (transition t^) for an exponentially distributed amount of time 
(transition 14 ), as shown by the inhibitor arc from place P 3 to transition ^ 2 - 

When a customer arrives to a server, a specific service requirement 7 ^ has 
to be completed. The amount of computation required is sampled from the 
distribution function Fg{t) of the service time. The optimal case is when the 
server is able to complete the job before an interruption occurs. However, the 
server may be interrupted after processing only a portion of the submitted job. 
In this case the whole behavior is strongly affected by the preemption policy and 
the whole performances will depend on the strategy adopted to deal with the 
preempted job, as described in the following: 

— The server drops the customer it was dealing with before the interruption. 

— The server goes back to the preempted customer who still maintains the 
original work requirement 7 ^. 

— The server also returns to the same customer who still has the same work 
requirement 7 ^. 

According to [5] and [4], the previous policies are referred to as preemptive 
repeat different (prd), preemptive resume (prs) and preemptive repeat identical 
(pri), respectively. Note that in [1] the authors indicated the prd and prs type 
policies as enabling and age type. The pri policy was introduced for the first 
time in [4] . The prd policy is the only considered in the available tools modeling 
non-Markovian SPN [16,13,10]. The ESP tool [11] allows to deal with prs policy 

^ A marking Mi is a tuple, whose cardinality is ||'P||, recording the number of tokens 
in each place. 
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through a continuous time PH approximation. Recently German developed a tool 
with Mathematica package where the prs policy is also implemented adopting 
the method of supplementary variable [12]. 

From the previous discussion it is clear that the main difficulty in analyz- 
ing stochastic Petri nets with general transitions is related to the fact that the 
underlying discrete state marking process is no longer a CTMC, as its future 
evolution depends on the past history. Below, we call general (GEN) transitions 
both the transitions with generally distributed firing time (including the deter- 
ministic ones) and the exponentially distributed firing time transitions of pri 
type. For a transition with exponentially distributed firing time the prd and the 
prs policies have the same effect, due to the memoryless property. We denote 
these transitions as EXP transitions. For a transition with deterministic firing 
time, the prd and the pri policies have the same effect, since a resampling of the 
firing time results in the same firing time sample each time. 

According to this memory concept, at any time the marking and the individ- 
ual memory associated with the GEN transitions of a NMSPN only determine 
the future stochastic behavior of the NMSPN. This means that the marking 
process together with the memory process of the GEN transitions is a Markov 
process. 

Below we make a distinction between enabled and active transitions. In Fact, 
a GEN transition may be active (the age variable Og is between 0 for a prs 
transition or the threshold value 7 g is already set up for a pri transition) but 
not enabled. 

The main idea behind our proposed discrete time approach is to discretize the 
continuous memory process and the time to obtain a Discrete time Markov chain 
(DTMG) that approximates the stochastic behavior of the compound Markov 
process. The time access is divided into equal intervals of size S, while we use 
the concept of discrete phase type distributions (DPH) to discretize the memory 
process when it is possible. 




Fig. 1. SPN with one GEN transition 



3 Discrete-Time Approach 

In order to provide a better explanation on how to approximate the stochastic 
behavior of continuous time SPNs, the SPN shown in Figure 1 is considered as 
an example. This SPN models a system that alternates between two conditions: 
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a fully operative state (token in place P 2 ) and a failure state (token in place Pi). 
The EXP transitions ti and t 2 describe the changes in the system state from 
operational to failed, and vice versa. Transition models the duration of the 
work to be performed, and it is assumed to be non-exponentially distributed. 
In this example the DPH [3] distribution, depicted in Figure 2a, with generator 
P = {Pij} and initial probability a = {1,0,0, . . .} is used to approximate the 
firing time of According to this DPH structure, the firing of transition can 
happen when the DPH is either in phase 2 or in phase 4. We want to stress that 
the DPH of Figure 2 is used only as an example to show how our approach works, 
but, in general, there are no restriction to the usable DPHs to approximate the 
firing time of a GEN transition. Similarly, figure 2b depicts the DPH we have 
adopted to approximate the firing of an exponentially distributed transition, 
where A is the firing rate and 6 is the approximation step. 

In this paper we assume that the DPH distribution starts from the first 
phase with probability 1. This assumption is not restrictive since any acyclic 
DPH distribution can be represented with an acyclic DPH distribution of the 
same order starting from the first phase with probability 1 ([3]), and a general 
DPH distribution of order n with generator P and initial probability vector a 
can be represented as P' and a' of order n + 1, where 



0 


aP 


0 


P 



a' = {l,0,...,0}. 




Fig. 2. The DPH approximation of the firing time of ts 



3.1 SPN with One Generally Distributed PRD Transition 

Let us suppose that the GEN transition is associated with a prd memory policy. 
Using DPH distributions, the state of the expanded DTMG is defined as a pair of 
non negative integers (z,u), where i is the index of a marking {Mi G RG(Mo)), 
and u is a phase of the DPH associated with the GEN transition. Thus, u is 
used to capture the “memory” that is necessary to model the GEN transitions. 
u = o denotes that the process is in a state where the general transition is not 
influential (i.e. it has no memory). li \ <u <v, the GEN transition is enabled. 
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The pair (i,u) will be called descriptor and identifies the state of the expanded 
DTMC. 

Figure 3 gives the DTMC constructed to approximate the stochastic behavior 
of the Petri net depicted in Figure 1. The chain is derived from the reachability 
graph. All the states in the reachability graph are examined, and the DTMC is 
generated depending on the transitions enabled in each of them. Each marking in 
the original continuous process produces a set of states of the expanded DTMC 
characterized by the same index i in the descriptor (i,u). All the states with 
the same marking index in its descriptor constitute a macrostate. Of course, the 
expanded process has as many macrostates as the number of markings of the 
continuous process. In Figure 3 the three macrostates are outlined by ellipses 
with the name of the marking depicted nearby. 




Fig. 3. DTMC approximation of the SPN on Figure 1 with prd transition 



In marking Mq only one EXP transition is enabled, so that memory does 
not need to be maintained, and the macrostate has only one state, the state 
(0,o). From this macrostate only two arcs can exit, relating to the firing or not 
of the EXP transition. Similar considerations can be done with reference to the 
marking M 2 . In this marking, no transitions are enabled, thus no memory is 
needed. 

Conversely, in marking Mi an EXP and a GEN transition are enabled; then 
the marking is expanded to describe the evolution of the GEN transition using 
the DPH. The macrostate corresponding to this marking has the same number of 
states as the number of phases of the DPH: the states with descriptor (1, u), with 
1 < u < 4. The absorbing state of DPH is not used to expand the marking into 
the macrostate, because it represents the firing of the GEN transition (therefore a 
change into a new marking) . The states in the macrostate describe the evolution 
of the transition t^. 

Since ts is a prd transition, when it becomes enabled its age memory starts 
from zero. This means that the DTMC enters the first phase of the DPH, in 
the example the state with descriptor (1,1). Since a step of the DTMC cor- 
responds to a time slot of length S, the one step probabilities of the two out- 
going arcs model the firing or not of the enabled EXP transition in an inter- 
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val of length S. Using the first order approximation for the exponential func- 
tion it is easy to realize that: P(o,<>)^(i,i) = P{ti fires |(0,o))} = i5 Ai and 
P(o,o)^(o,o) = P{ti does not fire |(0,o))} = 1 — 5 Ai, where Ai is the rate of the 
EXP transition ti. 

As we have already said, the macrostate with states with 1 < u < 4, 

corresponds to the marking Mi. The marking process remains in such marking 
till one of the two transitions t2 or fires. If both of them do not fire, the 
marking does not change. This means that the DTMC stays into the macrostate, 
and only passages between two phases of the DPH are possible. The one step 
probability must take into account that the EXP transition does not fire, thus 
the probability between a state with descriptor (l,u) to one with descriptor 
(l,ti) is: = P„„(l - SX2) 

The outgoing arcs from the macrostate are due to some firing: when t2 fires, 
the DTMC goes to a state with descriptor (0,o), because this firing causes the 
marking process to go to the marking Mq; whereas the firing of is described 
from the arcs towards the absorbing phase of the DPH, so that two arcs from 
states with descriptors (1,2) and (1,4) towards the state with descriptor (2,o) 
are used. The one step probability is easily computed with reference to the firing 
events and to the DPH structure. The only thing to note is that in a time 
slot 6 both transitions t2 and can fire, then the simultaneous firing event 
has to be considered. In the case when both transitions fire in the same time 
slot, we uniformly distribute the probability of firing between the two possible 
destination states with descriptor (l,o) and (3,o). This is where the factors 
P25 S ^ and P45 6 ^ come from. 

3.2 SPN with One Generally Distribnted PRS Transition 

If the GEN transition is associated with a prs policy, the DTMC structure has 
to be organized in order to keep track of the amount of time the prs transition 
spent in an enabled condition before being preempted. This is because the tran- 
sition has to restart with the same age memory value once it becomes enabled 
again. For this purpose, a different expanded DTMC is needed. Figure 4 shows 
the DTMC that approximates the stochastic behavior of the SPN depicted in 
Figure 1 when ts has a prs memory policy. 

The only difference with regard to the prd case is the macrostate related 
to the marking Mg. With a prs policy, four states with descriptors (0,u), with 
1 < M < 4, are added to the macrostate. The purpose of these descriptors is 
remembering the value of the age memory of transition tg when it is disabled by 
the firing of the EXP transition t2 ■ 

Thus, from each state with descriptor (1 ,m), with 1 < u < 4, the DTMC 
can transit either to the state with descriptor (0, u) (with one step probability 
P(i,u)^(o,u) = Puud\2) or to the state with descriptor (0, u-l- 1), where 1 < m < 3 
(with probability p(i,„)^(o,„+i) = Pu(u+i)5\2)- 

Of course, transition tg cannot fire from any of the states corresponding 
to marking Mg, as it is not enabled in such marking. From each of these 
states it is possible either to exit with probability P{o,u)^{i,u) = with 
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1 — 1 — 1 — (SAj^ 1 — (SAj^ 1 — 5A|[ 




Fig. 4. DTMC approximation of the SPA with prs transition 



u = 1,2, 3, 4, when transition t 2 fires, or to remain in the same state with prob- 
ability P(o,u)^(o,u) = 1 — <5Ai, with u = 1, 2, 3, 4, when t 2 does not fire in a time 
slot S. 

The same considerations made with regard to the firing of ts with a prd policy 
are valid in this case. 



3.3 SPN with One Generally Distributed PRI Transition 

If a pri policy is assumed for the GEN transition an interrupted job must be 
repeated with an identical work requirement. To capture the stochastic behavior 
of this case, a different expanded DTMC is constructed. 

To model a transition tk with an associated pri preemption policy, the fol- 
lowing quantities are computed = F^{i6) — F^{{i — 1)(5). approximates 
the firing probability of transition tk in the t-th 5 interval. For making the model 
solvable in practice, the firing time distribution of a pri transition is supposed 
to have finite support, in order to avoid the computation of an infinite number 
of nonzero Q? values, and to construct an approximate discrete process with 
infinite state space. Let us denote the number of nonzero quantities with . 
In case of infinite support, a truncation of F^{t) may be used. 

The stochastic behavior of an enabled pri type transition is described by two 
continuous variables: the actual sample of the firing time and the amount of 
time during which the transition has been enabled. In the proposed expansion 
method, the descriptor (t, u, w) with u < w is used in order to describe the state 
of the process, where i indicates the marking, u indicates the duration of time 
while the transition is enabled (measured in integer numbers of time slots 6), 
and w is the sampled value (measured in integer numbers of time slots <5). The 
descriptor (z,0,w) indicates that the pri transition is disabled but it has not 
fired, so that the sampled firing time w is maintained; after becoming enabled 
again, the process enters state {i, 1, w). The descriptor (i,o,o) is used for states 
where the process has no memory. 
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The evolution of the GEN transition tk with pri memory policy in isolation 
can be described by columns. The w-th column consists of w states with 
descriptors (i,u,w), where 1 < u < w. Recalling that w is the sampled firing 
time, when the discrete process enters a state with descriptor (1,1, w), w slots 
of time have to pass before the firing. This is exactly the time spent to transit 
among the states of the column. 

Figure 5 shows the DTMC that approximates the behavior of the SPN shown 
in Figure 1. In this case, the macrostate corresponding to the marking Mi con- 
sists of the states approximating the GEN transition as described before. 
From the state with descriptor (0,o,o), the DTMG enters the macrostate corre- 
sponding to marking Mi, and specifically the column selected according to the 
probability Q™- Since this happens if the EXP transition ti fires in a time slot, 
the one step probability is: P(o,o,o)^(i,i,u)) = QwSXi. 

The macrostate referred to the marking Mq has states with descriptor 
(0,0, w) reached by the DTMG when the GEN transition is disabled by the 
firing of the conflicting transition These states are used to remember the 
correct sampled firing value, so when the GEN transition is enabled again the 
correct column is reached. The one step probabilities between two states in 
this macrostate are computed according to the firing events related to the EXP 
transition t 2 , also enabled in marking Mi, as in the other cases. 

The GEN transition fires when u = w in the descriptor. When this hap- 
pens, the DTMG transits in the state with descriptor (2,o,o). 

4 General Solution 

In the last three subsections we have described a method to build a DTMG 
to approximate the stochastic behavior of Petri Nets containing only one GEN 
transition. Using the same approach, in this section we show how to derive the 
underlying DTMG for SPNs with more than one GEN transitions simultaneously 
enabled. The following notation has to be introduced: 

— and is the number of prd, prs and pri transitions in the SPN, 
respectively; 

— A^{i), M'^(i) and {i) are the set of enabled prd, prs, pri GEN transitions 
in marking Mi, respectively; 

— Pij is the probability of moving from phase i to phase j in the DPPl structure 
of the transition tk', it describes how a prd or prs GEN transition changes its 
phase; 

— Qi is the approximated probability that the pri GEN transition tk fires in 
the i-th S interval; 

— Lk is the number of phases in the DPPl structure of the prd or prs transition 
tk- 



As already discussed, we need one variable to handle transitions with prd and 
prs policy (to store the current phase of the expanded DTMG), and two variables 
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Mo 




Fig. 5. DTMC approximation of the SPN with pri transition 



to handle transitions with pri policy (one to store the age of the transition, and 
the other to store the sampled value of the firing time). 

When a pri transition gets enabled, the associated random variable is sampled 
and the age variable is set to 1 ^ . If the pri transition gets preempted in the next 
state, the age variable is reset to 0 and the associated sampled value remains 
the same. 

Thus a generic state of the DTMC will be = (j, D'’, S''’, X’~), where 

— j is the index of marking Mj of the SPN; 

D'^ is a vector of length ||T||, the number of the transitions in the SPN, 
storing the phases in which a prd transition is allowed to be; in particular, 
its fc-th element (D(!) is the phase of transition tk when the DTMC is in 
the state Z^; the sign o in the k-th position indicates that the prd GEN 
transition tk has no memory (it is not enabled). 

— S'" is the same as but for prs GEN transitions; S^. = o means that the 
prs transition tk is not active, thus it has no memory; 

— I’’ is a vector of length ||T||. The k-th element of F {IJl) is the age of the 
pri GEN transition tk when the DTMC is in the state Z"^; similarly to the 
case of prs transitions, = o indicates that transition tk is not active; 

^ Note that as time increases by S, at step i the total elapsed time is i*5. This explains 
why only the index indicating the time interval has to be recorded. 
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_ X’' is a vector whose k-th element is the sampled value of the pri GEN 
transition tk when the DTMC is in the state Z'" . 

Given a state Z'’ = (i, O'", S’’, I’’, X'’) of the DTMG, we consider first the 
case when none of the enabled transitions fires in a time slot 5, and then we 
show the more complex case when some firings occur. 



4.1 Initial States and Probability Vector 

'When the algorithm starts to generate the approximated discrete process, a set of 
initial states are created together with an initial probability vector. The number 
of initial states depends on the transitions enabled in the initial marking Mg. 
The prd and prs transitions are considered without memory when the process 
starts, thus, using the assumption that the DPH distribution starts from the 
first phase, the memory variables associated to these transitions are described 
by the following equations: 

£)0 _ f 1 G “4^(0) ^0 _ f 1 

^ [ o otherwise ’ ^ [ o otherwise ' 



Instead, if pri transitions are enabled in the initial marking Mg, a set of states 
has to be created to remember the different levels of sampled values Q^, with 
tk S yl^(O). To explain how to build the initial states, a new notation has to be 
introduced. Let be the maximum value of XJ^; is the number of different 
possible values, Qf , for the firing probability of tk (if the pri transition tk is not 
enabled in the marking Mg, then = 0). The number of initially built states 
is s = riifc 6 . 4 r(g) ^md each of them corresponds to a different combination of 
the possible values assumed by Xk- 

To formally construct the descriptor, we define a function that associates each 
possible state with an index starting from the values assumed by the components 
of X. Let ki be the index of the Lth pri transition enabled in Mg (tfc, G ^'^(0)). 
'With this formalism the index is r = ^ — 1) I 

Vice versa given a value of index r, the combination that generated it can be 
found. We denote this function v{r,l). 

With these definitions, it is possible to describe all the components of the 
state descriptors generated at the beginning. The different components of states 
Z’’ = (t,DGS’’,r,X’’), Vr = 0,-- -,s-l are the vectors O’- = D°, and S’’ = S°, 
whereas I’’ and X’’ assume the following value: 

^ r 1 tfe G -4^(0) ^ (v{r,l{k)) tk £ A^O) 

^ ( o otherwise ’ ^ ( o otherwise ' 



where l{k) is the position of transition tk among the enabled pri transitions in 
Mg. 

The generic element of the initial probability vector is: 

77,(0) = n QV.dfc))’ Vr = 0,---,s (3) 
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4.2 No Firing 



In this section we describe how to generate a new state of the expanded dis- 
crete process starting from a given state of the expanded process itself, in the 
occurrence of no firing of the enabled transitions. 

Let Z’’ = (t, D’’, S’’,!’’, X’’) be the descriptor of the actual considered state, 
and Z'' = (i, D’’ , S’’ , I’’ , X’’ ) the descriptor of the state we want to generate. 
Note that the first component of the descriptor Z’’ is the same of Z’’ because no 
firing is supposed and the marking does not change. The different components 
of the descriptor Z’’ for yrd and prs transitions are built as follows: 



D 



r' 

k 



a, I < a < Lk tk & 

o tk^A^li) 



or 
O u 



b, I < b < Lk tk e A^{i) , . 

SI tk ^ A^{i) 



where a = next{tk, and b = next(tfe,S'p, being next(t,p) a function that 
computes the index of a phase of the DPH associated to transition t reachable 
from the phase with index p. 

The vectors for pri transitions are computed as follows: 



lll + ltk&A^{i) 
\ II tk^A^ii) 



(5) 



These last two equations describe how to manage the two variables for the pri 
transitions depending whether the transition is enabled or not. 

The state transition probability from Z’’ to Z’’ in a time slot is computed 
as the product of the probability that none of the EXP transitions will fire 
times the probabilities that the prd and prs GEN transitions change their phase: 
Pz-^z-' = rifceA°(i) ri;eAS(i) 



4.3 Firing of One or More Transitions 

In this section we deal with the problem of one or more transitions firing in a 
time slot S. 

Given a state = (i, D’’, S’’, F, X’’), only a subset of the enabled transi- 
tions is allowed to fire in a time slot <5. Transition tk G A^{i) (an enabled prd 
transition) can fire if > 0; a transition tk € A^(i) can fire if Pgr > 0; 

a transition tk G A^ (i) is allowed to fire if IJl = XJi. Let T'' be the set of all the 
transitions that are allowed to fire when the process is in state Z'' . The elements 
of this set, whose cardinality is ||1F’’||, can be grouped into 2^^^ II — 1 different 
subsets, corresponding to all the possible combinations of the transitions allowed 
to fire in marking Mi. A generic subset of the IF’’ will be indicated as with 

p= l,---,2ll^''ll - 1. 

Gonsidering the generic state Z'" = (j, D’’ , S’’ , I’’ , X’’ ) reachable from Z'' 
when the transitions belonging to IF^ fire, the values of the components of its 
descriptor are computed as follows: 

ri tk&{A^{j)pT^)U{A^{j)\A^{i)) 

Dk = < a, 1 < a < Lfc tfc G (6) 

I o otherwise 
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where, as in the equation (4), a = next(tk, D^. ). The first two terms of the equa- 
tion appropriately set the components of the vector when tk becomes enabled 
or remain enabled respectively. 

(o 

or' _ I 1 ffe G n 

" ) b,l<b<Lktk€(A^(z)nA^(j))\J^^ 

I SI otherwise 

where b = next {tk, SI). The different terms of this equation update the vec- 
tor according to the fact that the transition tk becomes enabled, fires, remains 
enabled or is not enabled but is active, respectively. 



0 tkG {T^XA^ij)) 

1 tk€A^{j)n^p 

r, + itke{A^{i)nA^{j))\r^ 
IJl otherwise 



fo tke{W\A^j)) 

I tke{A^{j)nw)v 
Y {tkeAHj)AXl=o) 
{XJ) otherwise 



( 8 ) 

These equations are introduced for considering the updating of the memory 
associate with tk and its sampled firing values the values introduced in section 
4.1). We want evidence that the second term of equation for XI. is referred to 
the enabling of the pri transition tk either when it fires in marking Mi and it 
becomes enabled again or it becomes enabled in marking Mj when it has no 
memory. 

Due to the time discretization approach we have adopted, the cdf associated 
to each timed transition will have a time discontinuity at the end of each time 
slot 5. Thus, if in marking Mi several transitions are enabled, there is a non null 
probability that they simultaneously fire. The probability that the transitions 
belonging to simultaneously fire can be expressed as follows: 



fp = P{all transitions in fire | Z'' } = H 

k(^(XlnA^A)) /e(FpriyfS(i)) 

( 9 ) 

Equation (9) does not include any reference to pri transitions Because, if 
pri fire, their contribution to fp is equal to 1. This probability will cause the 
switching from the generic marking Mi (where the process is now) to a marking 
Mj. Since the transitions in iFp may be in conflict, the marking Mj is reached 
with a probability Wij (the method to compute the reached marking Mj and 
the probability Wij is deeply analyzed in [18]). 

The probability associated with the arc from Z'' to Z''' is evaluated as: 



P. 



*Zr' — 



Pd n 



ph 

Dr PI 



n 

ke{AHi)nAAj))\K 



pt 



n 



Q 



XT 






( 10 ) 



where Py = ff ■ Wij. 
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5 The Algorithm 

The algorithm is based on a discretization of the continuous random variables for 
approximating the continuous process. The phase type distributions, used in case 
of prd and prs GEN transitions, are given by the users, whereas the probabilities 
Qf are directly computed from the cdf associated with the pri transition tk- 
The main steps of the implemented solution method are the following: 

1. generation of the reachability graph (with tangible and vanishing states) and 
reduction of the reachability graph to tangible states only; 

2. generation and analysis of the expanded DTMC; 

3. evaluation of the final measures at the net level, based on the solution of the 
expanded DTMC. 

According to the results shown in the previous sections, given the reacha- 
bility graph and the discrete phase type distributions associated to the GEN 
transitions, the elementary step 2 of the approximation method is as follows: 

Initialization Step 

Initialization consists of creating the set of states originated in the ini- 
tial marking Mq. Equations (1), and (2) are used to compute these states; 
equation (3) is used to compute the initial state probability vector on the 
generated states. The created states are put in a list of states to expand 
(list .expand). 

— Iteration Step 

1. a state Z’^ to be expanded is extracted from the list.expand list; 

2. new expanded states Z’^ are computed in case of no firing events using 
the equations (4), and (5). 

3. the probability computed; 

4. all the states Z^ , not previously created, are stored in list.expand; 

5. sets Pp, with p = 1, • • • , — 1, are computed; according to these 

sets, other reachable markings Mj are computed, and expanded states 
Z^ are built using equations (6), (7), and (8); 

6. using equation (10) the transition probabilities from Z'" to Z'" are com- 
puted and stored; 

7. all states Z’^ , not previously created, are stored in list.expand; the 
state Z'" is stored in another list named expanded; 

8. if the list list .expand is not empty, the algorithm proceeds with step 
1, otherwise it terminates. 

Similarly to [9], the system behavior is approximated by a DTMC over an 
expanded state space determined by the cross product of the system states (the 
markings of the Petri net) and the discretized values of the associated age vari- 
ables. This approach is also closely related with the DPH expansion method 
proposed by Cumani in [11]. The main difference is that, in this case, the system 
behavior is approximated by an expanded DTMC while in the PH approxima- 
tion case an expanded CTMC is obtained. The present approach inherits some 
similarities also from the supplementary variable approach [14], since the sup- 
plementary (age) variables are constrained to assume values in a discretized set. 




Analysis and Evaluation of Non-Markovian Stochastic Petri Nets 



185 



6 Numerical Results 

For testing the described method, two kinds of experiments were done: 1) in the 
first experiment a preemptive M/G/1/2/2 queue model, presented in [5,4], whose 
customers belong to different user classes, was solved. This model belongs to the 
MRSPN class; 2) the second experiment involves the same queue model again, 
but all the transitions have a non exponentially distributed firing time, but one. 
The model cannot be solved with any of the available analytical techniques. Thus 
the obtained results were validated by simulation. 

The purpose of these experiments is to show that the results obtained by 
applying the method described in the previous section can be compared with 
those produced by other analytical solution methods, when available. Moreover, 
more general classes of models, not analytically solvable by others techniques, 
can be studied. The tool WebSPN^ [15] was used to solve the models under 
exam. 

The model was solved with the following numerical values: firing rate of the 
EXP transitions A = 0.5, service times of deterministic transition r = 1.0 and 
(5 = 0.05. 

Since the two jobs modeled have different priorities, one server can preempt 
the other, and all kinds of preemption policies can be adopted. The results 
obtained solving the model either with inverse Laplace transform and discrete 
expansion technique are depicted in Figure 6 a) and b) and c) respectively. 
The symbols x, *, o, and □ are used to plot the results obtained with Laplace 
transform method, whereas the continuous lines refer to the results obtained 
with the discrete expansion approach. From these graphs it is evident that the 
method works well and the results are almost coincident with those computed 
with the inverse Laplace method, that is extensively discussed in literature. 




Fig. 6. State probabilities of M/G/1/2/2 model when a prd (a), prs (b) and pri (c) 
memory policy is adopted. 



The second experiment was done by assigning deterministic submitting time 
to the job with higher priority. In this case, the model cannot be solved using 

® The tool WebSPN is accessible through Internet at the address 
http : / /sun 1 95 .iit . nnict .it / webspn 
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Fig. 7. Comparison with simulation when a prd (a), prs (b) and pri (c) memory policy 
is adopted. 



neither the Markov regenerative theory nor the supplementary variable method. 
The results are thus compared with those obtained from a simulator. We solved 
the model with the following numerical value: firing rate of EXP transition 
Al = 1.0, firing time of deterministic transition t = 0.5, service time uniformly 
distributed between 0.5 and 1.0, and 6 = 0.05. 

Figure 7 shows the the probability that no jobs are in the queue. The results 
of the simulation are depicted as two dashed lines, identifying the interval of 
confidence (95%) of the computed measure As it can be noted, the results of dis- 
cretization are always inside the interval of confidence computed by simulation, 
showing that the discrete expansion produces a correct result. 

More details on these experiments can be found in [18]. 

7 Conclusions 

A numerical approach for the solution of NMSPNs has been proposed. It is 
based on a discrete time approximation of the stochastic behavior of the marking 
process, which results in the possibility of analyzing a wider class of SPN models 
with prd, prs and pri concurrently enabled generally distributed transitions. In 
case of prd, prs policies distributions with infinite support are considered, for 
pri policy the firing distribution is limited to finite support distributions. We 
obtained that a pri type transition requires the inclusion of 2 memory variables 
in time domain. This explains why the representation of pri transitions is quite 
expensive. 

We discussed the way the time-discretization algorithm works both in the 
case of only one general transition in the model and also when an arbitrary 
number of GEN transitions are simultaneously active. 

The described algorithm has been implemented and embedded in the Web- 
SPN tool, for specification and automatic solution of non-Markovian SPN. 
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Abstract. This paper presents TimeNET, a software tool for the mod- 
elling and performability evaluation using stochastic Petri nets. The 
tool has been designed especially for models with non-exponentially dis- 
tributed firing delays. A general overview of the software package and its 
new features is given. The graphical user interface is completely rewrit- 
ten. It integrates different model classes in a user-friendly and consistent 
way. One of the recent enhancements is an environment for the modelling 
and performance evaluation of manufacturing systems based on coloured 
stochastic Petri nets. A manufacturing system is modelled and analysed 
as an application example. 



1 Introduction 

Model based performance evaluation of technical systems is a powerful and in- 
expensive way of predicting the performance before the actual implementation. 
Its importance increases with the complexity of the applications. Discrete event 
systems like computers, communication systems, and manufacturing systems 
are well-known examples. Failures and repairs make their behaviour even more 
unpredictable. The impact of the dependability on the system performance is 
captured in the notion of performability m- 

Petri nets are a graphical method for the convenient specification of discrete 
event systems. They are especially useful for systems with concurrent, synchro- 
nised, and conflicting activities. Evaluation of the performance is facilitated by 
associating stochastic firing delays with transitions. Basic quantitative measures 
like the throughput, loss probabilities, utilisation and others can be computed. 

Since the modelling framework of stochastic Petri nets has been proposed 
(overview of formalism in B), many algorithms and their implementations as 
software tools have been developed (e.g. GreatSPN [3], SPNP [S], UltraSAN [2Dj, 
HiQPN [2|, DSPNexpress m DEI). Most tools provide a graphical user interface 
for the convenient drawing and editing of the model. They usually include ana- 
lysis components for one or several classes of models. There are two main types 
of performance evaluation algorithms: direct numerical analysis or discrete-event 
simulation. Both techniques can be used to compute performance measures ei- 
ther in steady-state or after a transient time. 



B.R. Haverkort et al. (Eds.): TOOLS 2000, LNCS 1786, pp. 188- 120^ 2000. 
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TimeNET cni is partially based on an earlier implementation of DSPNex- 
press [16] ca, also performed at the Technical University Berlin [15|. TimeNET 
is intended for the evaluation of non-Markovian SPNs. Under the restriction 
that all transitions with non-exponentially distributed firing times are mutu- 
ally exclusive, stationary numerical analysis is possible i Ej. If the non- 
exponentially timed transitions are restricted to have deterministic firing times, 
transient numerical analysis is also provided [8] 1121 . In case of concurrently en- 
abled deterministically timed transitions, an approximation component based 
on a generalized phase type distribution is offered as well [2]. 

If the mentioned restrictions are violated or the reachability graph is too 
complex for a model, an efficient simulation component is available m . A mas- 
ter/slave concept with parallel replications and techniques for monitoring the 
statistical accuracy as well as reducing the simulation length in the case of 
rare events are applied m- Analysis, approximation, and simulation can be 
performed for the same model classes. TimeNET therefore provides a unified 
framework for modelling and performance evaluation of non-Markovian stochas- 
tic Petri nets. 

Recent enhancements of TimeNET include a component for the steady- 
state and transient analysis of discrete time deterministic and stochastic Petri 
nets |7] [22]; a component for the modelling, steady-state analysis, and simula- 
tion of stochastic coloured Petri nets, especially for the performance evaluation 
of manufacturing systems 123] Eli [2Sj; and a completely rewritten graphical 
user interface, which integrates all different net classes and analysis algorithms. 
Moreover, using a coloured Petri net model of a manufacturing system, the actual 
system can be controlled directly by TimeNET. 

This paper describes the new version of the tool TimeNET. It has been suc- 
cessfully applied during several modelling and performance evaluation projects. 

This paper is organised as follows. Section [2] contains a description of the 
supported net classes and analysis methods of TimeNET. The usage of the tool 
and information related to the new interface of TimeNET are given in section [3| 
Modelling and performance evaluation of a manufacturing system is illustrated 
in section 0 using an application example. Finally, concluding remarks are given. 

2 Supported Net Classes and Analysis Methods 

The software package TimeNET offers two different modelling and evaluation 
environments. Markov regenerative stochastic Petri nets (MRSPNs) with un- 
coloured tokens and their corresponding evaluation techniques are described in 
subsection 12.11 The part of the tool dealing with coloured Petri nets is presented 
in subsection 12.21 Coloured models can contain the same stochastic extensions 
like MRSPNs. 

2.1 Markov Regenerative Stochastic Petri Nets 

The term “Markov regenerative stochastic Petri nets” has been introduced in |1| 
and is now used in literature. In m cni the term “extended deterministic and 
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stochastic Petri nets (eDSPNs)” has been used for a similar model class. How- 
ever, this model class is closely related to the usual class of Markovian stochastic 
Petri nets, popular examples are generalized stochastic Petri nets (GSPNs) [Tj or 
stochastic reward nets (SRNs) [T^- It consists of places and transitions, which 
are connected by input, output, and inhibitor arcs. Places are drawn as cir- 
cles and can contain indistinguishable tokens, while transitions are depicted as 
rectangles. Convenient modeling elements known from SRNs like guards and 
marking-dependent arc cardinalities are also allowed. The main distinctive fea- 
ture of the framework is the with respect to the timed transitions: the firing times 
may also be deterministic (referred to as “deterministic transitions”, drawn as 
filled rectangles) and generally distributed (referred to as “general transitions” , 
drawn as shaded rectangles). The general distribution may be one of the class 
of so-called “expolynomial” ones, which may be represented piecewise by poly- 
nomials and expolynomials. This class contains common distributions like the 
uniform or the triangular one. The firing semantics of the general transitions can 
either be “race enabling policy” or “age memory policy” (in more recent papers 
also referred to as “preemptive repeat different” and “preemptive resume” ) . 

As usual, the current state of the model is given by the vector of token 
numbers in all places and is referred to as the marking. The reachability graph 
is defined by the set of vertices corresponding to the markings reachable from the 
initial marking and the edges corresponding to transition firings. If an immediate 
transition is enabled in a marking, no time is spent in it during the marking 
evolution. The reachable markings can be partitioned in vanishing and tangible 
markings accordingly. 

The behaviour of the model is given by the initial marking and the subse- 
quent transition firings, describing a stochastic process |S]. The type of process 
depends on the types of transitions and certain enabling conditions. The firing 
delay of transitions in TimeNET can either be zero (immediate), exponentially 
distributed, deterministic, or belong to a class of general distributions called 
expolynomial. Such a distribution function can be piecewise defined by exponen- 
tial polynomials and has finite support. It can even contain jumps, making it 
possible to mix discrete and continuous components. Many known distributions 
(uniform, triangular, truncated exponential, finite discrete) belong to this class. 

The analysis in the Markovian case is performed as usually. In the MRSPN 
case it is required that the general transitions are mutually exclusive (i.e., two 
of them may not be enabled in the same marking) and have firing semantics 
preemptive repeat different. The stationary analysis is then based on the con- 
struction of an embedded Markov chain [^. If the MRSPN is restricted to a 
DSPN (the general transitions are deterministic), the transient analysis is also 
possible. It is based on the method of supplementary variables Em m- 

TimeNET also provides an approximation component for DSPNs with con- 
currently enabled deterministic transitions based on generalized phase type ex- 
pansion [H]. During the transient analysis, TimeNET shows the evolution of the 
performance measures from the initial marking up to the transient time graphi- 
cally. 
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TimeNET also comprises a simulation component [M] [^, which evaluates 
eDSPN models without the enabling restriction. The simulation component of 
TimeNET can be applied for the steady-state and transient evaluation. 

A recent addition to TimeNET are components for the analysis of models 
with a discrete time scale |3 |22| . allowing for geometric distributions, determin- 
istic times and discrete phase type distributions. This model class is referred to as 
discrete deterministic and stochastic Petri nets (DDSPN). Due to the mapping 
technique on a discrete time Markov chain there is no enabling restriction. 

2.2 Coloured Petri Nets for Manufacturing Systems 

Manufacturing systems are a classical application area of Petri nets; see for a 
recent survey. The class of coloured stochastic Petri nets used by TimeNET has 
been introduced especially for the modelling of manufacturing systems, although 
it is not restricted to it. Two colour types are predefined: Object tokens model 
work pieces inside the manufacturing system, and consist of a name and the 
current state. Elementary tokens cannot be distinguished, and are thus equiv- 
alent to tokens from uncoloured Petri nets. Places can contain only tokens of 
one type. Textual descriptions needed in coloured Petri nets for the definition 
of variables and colour types can be omitted, and the specification of the types 
of places and arcs are implicitly given. TimeNET supplies a library of template 
models for typical machines and their failure-and-repair behaviour. The models 
are hierarchically structured, which is necessary to handle complex systems. 

Structure and work plans are modelled independently using this net class. 
This is important for the evaluation of different production plans, where the 
structural model is not changed. The structural model describes the abilities and 
work plan independent properties of the manufacturing system resources, such 
as machines, buffer capacities, and transport connections. Production sequence 
models specify the work plan dependent features of the manufacturing system. 
Each route can be thought of as a path through the manufacturing system. Later 
on, the different model parts are automatically merged resulting in a complete 
model, which then includes both the resource constraints of the system and the 
synchronisation of the production steps. 

After specifying a manufacturing systems model using the graphical inter- 
face, its performance and dependability can be evaluated. Before doing so, one 
should check whether the model correctly describes the system. Structural ana- 
lysis methods can be used for this task, which are currently being implemented. 
A second debugging tool is the token game, showing the states and actions of 
the system. With TimeNET the modeller is able to either interactively check the 
model behaviour or to watch an automatic animation of it. However, the main 
focus is on efficient quantitative evaluation of performance and dependability 
measures. 

Firing delays are associated with transitions for the performance evaluation. 
We adopt the set of distributions as defined for eDSPNs here. They include 
immediate, exponential, deterministic, and more general transitions. TimeNET 
provides components for the steady-state analysis and simulation. The numerical 
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analysis techniques for eDSPNs as described above have been adapted to the 
coloured net class. This is possible because the stochastic process belongs to the 
same class. 

Details of the component for manufacturing system modelling and evaluation 
can be found in [^. There are two additional components which are 

outside the scope of this paper and are therefore only briefly mentioned. Users 
without knowledge of Petri nets can model a manufacturing system with func- 
tional blocks originating from the held of production management 1261 . Those 
models are automatically translated into coloured Petri nets and analysed. The 
provided performance evaluation algorithms can therefore be used without the 
need to use Petri nets. A recent enhancement is the possibility to directly control 
a manufacturing system with the token game or animation feature of TimeNET. 



3 Agnes — The New TimeNET Interface 

A powerful and easy to use graphical interface is an important requirement 
during the process of modelling and evaluating a system. For version 3.0 of 
TimeNET a new generic graphical user interface has been developed. It is called 
“agnes” (a generic net editing system) . 



3.1 Tool Usage 

All currently available and future extensions of net classes and their correspond- 
ing analysis algorithms are integrated with the same “look-and-feel” for the user. 
Figure [T] shows a sample screen shot of the interface during a modelling session 
with coloured Petri nets. 

The upper row of the window contains some menus with basic commands 
for file handling, editing, and adjusting display options. Under the menu item 
“Module” the analysis algorithms applicable for coloured Petri nets can be ac- 
cessed. Some of the most frequently used functions are also available by clicking 
one of the icons below. The current mouse position and zoom level is shown 
on the right. Below the icon list the current net class is displayed (HCPN for 
hierarchically coloured Petri net in this case). The net objects are displayed on 
buttons in the next row. Each one is shown as an icon and a textual identifier. A 
group of objects is accessed via each one of these buttons. For instance, all tran- 
sitions are organised in one group. Clicking on the transition button switches to 
the next type, but the desired one can also be selected directly by a menu that 
pulls down from the textual object description. 

There are two types of objects available in general, which form the nodes and 
arcs of the model class. One node object and one arc object is always selected and 
highlighted. Textual objects like parameter definitions, performance measures, 
and comments are special cases of node objects without the possibility to connect 
arcs with them. 

The main drawing area contains a part of the current model. It can be edited 
with the left mouse button like using a standard drawing tool with operations 
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Fig. 1. Sample screenshot of the graphical user interface 



for selecting, moving, and others. Clicking the middle mouse button creates an 
object of the currently highlighted type. Dragging from one object to another 
creates a new arc between them. Double-clicking on any object displays a window 
with its attribute identifiers and their current values, e.g. for a place the initial 
marking and the name. If the object is hierarchically refined (like a substitution 
transition), double-clicking it displays the refining model. 

Depending on the net class, different objects are available in the lower icon 
list. In addition to that, analysis methods typically are applicable for one net 
class only. Those methods are integrated in the menu structure of the tool, 
which therefore changes automatically if a different net class is opened (see 
subsection E21. 

For eDSPNs there are several additional options. The menu “Validate” con- 
tains algorithms for checking the net for syntax errors, computing structural 
properties like place invariants and extended conflict sets, and the interactive 
token game. The token game can now be run automatically as an animation of 
the system behaviour or with the user selecting the firing transition. 

Performance evaluation algorithms are accessible in the menu “Evaluation” . 
Methods for the stationary analysis, approximation, and simulation, as well as 
transient analysis and simulation are available. When an evaluation method is 
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chosen, the user has to decide whether the underlying time scale should be 
discrete or continuous. Afterwards, the evaluation can be started or stopped 
(if running). Several algorithm-dependent options can be set. Examples are the 
method of solving the linear system of balance equations and the confidence 
interval of the simulation. 

It is often necessary to evaluate the system performance for different pa- 
rameter values in some given range. The corresponding “experiment” feature of 
TimeNET has been improved in the current version. It is now possible to either 
linearly or logarithmically change the varying parameter. Experiment descrip- 
tions can be stored in a file for later reuse. After the tool has finished all required 
evaluations for the parameter range, the results are automatically plotted using 
gnuplot. 

During the transient analysis and simulation, a window displays the tran- 
sient development of the specified performance measures. A graphical editor for 
specifying probability density functions of transitions with generally distributed 
firing delays is available. 

If a manufacturing system is modelled and evaluated with TimeNET 3.0, 
several different net classes with their own applicable algorithms are used. The 
modeller can start with function symbols and translate them into a Petri net 
with separate models of structure and processing sequences. For this task as 
well as the modelling of large manufacturing systems with e.g. similar machines 
the library of Petri net submodels is used. The library in itself is also edited 
with its own net class in TimeNET. Instantiation of a library model and the 
translation of function symbol models into Petri nets as well as the compilation 
of a complete model from the two separate parts are all implemented as net class 
dependent extensions of TimeNET. 

3.2 Net Classes and Modules 

Software development for TimeNET is usually done by students as part of their 
diploma or PhD thesises. A major problem is therefore to keep all analysis com- 
ponents modular with well-defined interfaces. Especially the graphical user inter- 
face has to be able of integrating different net classes and algorithms and should 
be easily extendable. The existing interface of the former version of TimeNET 
was designed for uncoloured Petri nets without hierarchies. Therefore it has been 
completely rewritten for version 3.0. 

A generic interface was implemented, that can be used for graph-like models 
with different types of nodes and arcs. Nodes can be hierarchically refined by 
corresponding submodels. The new interface agnes is not restricted to Petri nets. 
In fact, it is already being used for other tools than TimeNET. It is implemented 
in C-I--I- and uses the Motif toolkit. 

Two design concepts have been included in the interface to make it applica- 
ble for different model classes: A net class corresponds to a model type and is 
defined by a text file. In this file, for each node and arc type of the model the 
corresponding attributes and the graphical appearance is defined. The shape of 
each node and arc is defined using a set of primitives (e.g. polyline, ellipse, and 
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text). Shapes can depend on the attribute value of an object, making it possi- 
ble to show tokens as dots inside places. Figure [T] shows a sample appearance. 
Nodes can correspond to submodels of a different net class. This facilitates e.g. 
the organisation of the library submodels like in a file manager. 

Functions beyond drawing a model depend on the net class and require pro- 
gramming. Agnes offers the possibility to implement modules that are compiled 
and linked to the program. A module has a predefined interface to the main 
program. It can select its applicable net classes and extend the menu structure 
by adding new algorithms. An example of an implemented module applicable 
without restrictions is an export filter to xfig. 

With the described user interface, two modellers are able to work together 
on the same model over the internet. It is possible to start an additional model 
editor window on a remote host. This window shows the same model as the 
original one and allows collaborative work on it. Model parts that are currently 
changed by one modeller are locked for the other. 



4 A Manufacturing System Application Example 

This section demonstrates the use of TimeNET 3.0 for the modelling and per- 
formability evaluation of a simple manufacturing system. A real-life industrial 
application example is considered in |24| . The example chosen here is a manu- 
facturing cell built of parts from the “Fischertechnik” construction kit. It is used 
for education and research at the department. Figure [2l shows its layout. 

New parts are initially stored in the high bay racking on pallets. The rack 
conveyor can fetch one of them and deliver it to the lower pallet exchange place. 
A horizontal crane then takes it to the first conveyor belt. The system of three 
conveyor belts moves the part to one processing station after another. There are 
two drilling stations, the second having three different interchangeable drilling 
tools. The last station is a milling machine. Parts stay on the conveyor during 
processing. After leaving the machines, parts are transported on a turn table. 
This table puts them into position for the slewing picker arm who takes the part 
to the upper pallet exchange place. From there it is brought back to a place in 
the high bay racking by the rack conveyor. 

The exchange of new and finished parts takes place via the rack storage. We 
assume that there are two different types of work pieces to be processed, named 
A and B. The first one has to be machined by the two drilling machines and 
the milling machine in this order. An additional drilling and milling operation 
is required for parts of the second type. The parts are moving counterclockwise 
through the system. 



4.1 Structural Model 

The described application example is modelled with the class of dedicated col- 
oured Petri nets as described in Section 12.21 A strict separation between the 
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Fig. 2. Overview of the modelled system 



model of the manufacturing system’s structure and the sequence of the process- 
ing steps for each product is observed. 

Figure 12 shows the top level of the structural model. Its composition follows 
the layout of the modelled system, which makes it easier to understand. Places 
model buffers and other possible locations of parts. The place rack corresponds 
to the rack storage, the places exchpll and exchpl2 to the pallet exchange 
places, and place turnpl to the turn table. The remaining four places represent 
the locations of work pieces on the conveyors which are directly in front of the 
machines or the horizontal crane. As described above, in- and output of parts 
takes place through the rack storage and is modelled with transitions input and 
output. 

In principal, there are two different operations that can be performed: trans- 
port and processing of work pieces. The former corresponds to moving a token 
to another place, while the latter is modelled by a change in the colour of the 
token that corresponds to the work piece. Transitions modelling machines spec- 
ify processing steps which only change the token colour. This is emulated by 
removing the former token form the place and instantly adding a token with the 
new colour during the firing of the transition. Therefore many transitions and 
places are connected by arcs in both directions, which are conveniently drawn 
on top of each other. The structural model contains all possible actions of the 
resources, even if they are not used for the processing. The horizontal crane 
could e.g. move parts to the left as well. 
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Fig. 3. Highest level of the hierarchical coloured model 



Transitions with thick bars depict substitution transitions, which are refined 
by a submodel on a lower level of hierarchy. These transitions are e.g. used 
to describe the behaviour of a machine with more detail during a top-down 
design. Submodels from a library of standardised building blocks (templates) 
can be parameterised and instantiated while refining the model. This alleviates 
the creation of complex manufacturing system models, where many structurally 
similar parts can be found. 

Transition rconveyor contains the model of the high bay rack conveyor, while 
transitions hcraine and picker correspond to the horizontal crane and slewing 
picker arm. For the transport of a part from one machine to the next, two of 
the three conveyor belts have to operate simultaneously. All three conveyors are 
therefore treated together as one transport means and modelled by transition 
conveyors. Thus, their synchronisation is hidden at a lower level and can be 
specified together. 

Figure [H shows the graphical user interface of the tool while editing such a 
refining submodel. The whole model consists of 11 submodel. The drawing area 
contains the model part corresponding to transition drilll of the high-level 
model. There is only one interface with the surroundings, place convpl2. This 
place is connected to transition drilll at the higher level of hierarchy, and is 
therefore known at the lower level. As it is only a reference to the real place here, 
it is drawn as a dotted circle. Place convpl2 is the only actual buffer of parts 
in this submodel. All other places correspond to different states of the machine. 
Therefore they are modelled using elementary places, which are drawn thin and 
can contain only uncoloured tokens. 
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The model has two main parts. The right hand part describes the failure 
and repair behaviour of the drilling machine. Exponentially distributed firing 
delays are associated with transitions Fail (mean time to failure) and Repair 
(mean time to repair). On the left is the model of the detailed steps during 
one drilling operation. The immediate transition Start is enabled if there is a 
token in place convpl2 with a colour indicating that the next processing step is a 
drilling operation at this machine. It fires and therefore enables a firing sequence 
of transitions On, Lower, Raise, and Off. They correspond to powering up, 
lowering, raising and finally switching off the drilling tool. Marking-dependent 
firing guards of these transitions ensure that they are not enabled if the drilling 
machine is failed (no token in place ok). The actual processing task is finished 
and the colour of the token corresponding to the work piece is changed with the 
firing of transition Raise. 

A submodel as the shown one can be exchanged for a model e.g. with a 
different failure behaviour without changing the model at the upper level of 
hierarchy. 



4.2 Production Sequence Model 

In addition to the structural model for each product a model of the production 
steps has to be defined. With TimeNET 3.0, the class of coloured Petri nets is 
used for this task as well. Figure 0] shows the first part of the production sequence 
model of one work piece. 

rack input rconveyor exchpll hcrane 

^ A.O 

convplZ dnijj convpl2 conveyors convpll 



Fig. 4. Part of the production sequence model 



This model describes the sequence of operations and transports for a part 
named A at the highest level of hierarchy. Each step can only be carried out by 
a resource that is available in the manufacturing system layout. Therefore, only 
transitions, places and their connecting arcs from the structural model can be 
used here. Arc inscriptions show the name (A) and processing state (0 or 1) of 
the work piece, separated by a dot. 

The model shown in Figure 0| corresponds to the structural model in Fig- 
ure E] Model elements in production sequence models refer to their structural 
counterpart through the use of identical names. It is obvious that a substitution 
transition in a production sequence model has to be refined with a submodel. 
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This submodel is then associated to the submodel of the corresponding substi- 
tution transition in the structural model. This relationship between both model 
parts holds for all submodels in the hierarchy. We use the term associated Petri 
nets for this concept of specifying different views of a system in related model 
parts. 

It is possible to model alternatives in production sequences of work pieces. 
Logical expressions depending on the current marking as well as probabilities 
can be used to choose a path for a token at such an alternative. A production 
sequence model usually consists of a simple succession of transitions and places. 
An exception is the modelling of (dis-)assembly operations. More than one input 
or output arc is connected to a transition in this case. Although it cannot be 
immediately seen in Figure IH there is also an assembly operation needed for 
the example production sequence. Each part is transported and processed while 
being fixed to one pallet. For an input of a new part into the rack storage, there 
has to be an empty pallet in it (place rack contains a token of colour P . e). The 
input operation (transition input fires) removes this token and puts back a token 
with colour A.O. The inverse operation is carried out by the output transition. 
The pallet itself has no different states and is only implicitly modelled together 
with a work piece. 

4.3 Performability Evaluation of the Example 

After the structure and work plans have been modelled with separate coloured 
Petri nets as described above, a complete model can be automatically generated 
by TimeNET [27]. This is done by adding the information contained in the 
processing sequence models to the structural model. The transitions are enriched 
with descriptions of their different firing possibilities. 

The resulting complete model can then be checked with an interactive sim- 
ulation (token game) or an automated animation. Structural properties like in- 
variants can be computed. The analysis and simulation algorithms of TimeNET 
are then started for an evaluation of the system performance, taking into account 
failures and repairs of the modelled system. 

For the application example, the goal is to maximise the throughput of the 
manufacturing system by adjusting the number of available pallets. This num- 
ber has usually an important influence on the system performance. The cost of 
additional pallets is not taken into account. However, it would be possible to in- 
clude measures of profits and costs in the model, to assess the overall gain. The 
production mix is set to 50% A and 50% B. A performance measure is defined 
in the model which gives the throughput of all finished parts per hour. 

A deterministic firing delay is associated to most of the transitions in the 
application example. The reason for this is that simple operations of an auto- 
mated system like in this case do not vary significantly. Therefore the restriction 
of at most one enabled transition with non-exponentially distributed firing de- 
lay in each marking is violated. The simulation component is thus used for the 
performance evaluation. All evaluations have been carried out on a cluster of 
ten UltraSparc workstations with a confidence interval of 99% and a maximum 
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relative error of 5%. Each simulation run typically took 20 seconds real time and 
125 seconds overall CPU time to complete. 




Fig. 5. Throughput of the system versus number of pallets 



TimeNET computes all results automatically with the experiment feature 
described in section |21 Figure O shows the resulting graph (upper plot). In the 
range of zero to three pallets, the throughput increases almost linearly. The 
number of pallets is the most significant bottleneck of the system in these cases. 
An optimal behaviour is achieved with five, six or seven pallets. The performance 
drops if their number is increased to eight, because there are only nine possible 
locations of pallets in the modelled system. Blocking becomes a problem in this 
case. 

To analyse the influence of the deterministic transition firing times a second 
model was evaluated. All deterministic delays were substituted by exponential 
ones with the same mean firing delay. The lower plot in Figure |5] depicts the 
results. Just like in the deterministic case, a number of pallets between five and 
seven is optimal. Otherwise the absolute values are quite different, showing the 
importance of deterministic firing delays for a precise performance prediction. 
Much effort is still needed in the development of efficient analysis techniques for 
models with non-exponentially distributed firing delays. 

5 Conclusions 

This paper described the software package TimeNET 3.0, which supports the 
modelling and performability evaluation of discrete event systems. Different 
classes of Petri net models are available, namely Markov regenerative stochas- 
tic Petri nets and their coloured extension. The latter is especially used for the 
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separate modelling of manufacturing systems and processing sequences. Efficient 
numerical analysis methods are implemented for the calculation of performance 
measures either in steady-state or up to a transient point in time. Basic structural 
properties are detected. An efficient simulation module is available, which takes 
advantage of parallel replications, rare event simulation, and control variates. 
The easily adaptable graphical interface integrates the model specification with 
the access to analysis modules for all net classes. Usage of TimeNET 3.0 for the 
modelling and performance evaluation of a manufacturing system is explained 
in the paper with an application example. 

The tool is available free of charge for non-profit use. It has been successfully 
used in many projects and was distributed to more than 150 universities and 
other organisations worldwide. TimeNET runs under Solaris and Linux. 

In the future we plan several extensions of TimeNET : implementation of tran- 
sient analysis and simulation for coloured models, a new structured modelling 
method based on Petri nets (Stochastic Petri Net Language), implementation 
of numerical analysis methods for fluid stochastic Petri nets, and methods for 
efficient performance analysis of large systems. 

The authors would like to thank the former colleagues and numerous students 
who have contributed to the development and implementation of the software 
package TimeNET. 
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Abstract. The compositional representation of a Markov chain using 
Kronecker algebra, according to a compositional model representation as 
a superposed generalized stochastic Petri net or a stochastic automata 
network, has been studied for a while. In this paper we describe a gen- 
eralized Kronecker structure, and its implementation, which is able to 
handle synchronization over activities of different levels of priority. The 
achieved Kronecker structure serves for functional analysis as well as 
performance analysis based on Markov chains. 



1 Introduction/Motivation 

Kronecker based approaches I14I19I15I17I for stochastic Petri nets (SPN) are 
centered around the idea of compositionality: if a system can be described in 
terms of interacting components fV*, we can consider the state space of the com- 
plete net as a subset of the cross product of the state spaces of the components. 
Matrix operators 0 and 0 (Kronecker operators) allow to compose matrices 
built from the reachability graph and/or the infinitesimal generators of the TV* 
into a Kronecker expression that completely characterizes the reachability graph 
RG and the infinitesimal generator Q of the complete net. Reachability and 
steady state/transient probabilities can be computed using a Kronecker expres- 
sion instead of explicitly storing RG or Q. The method results in a (usual) large 
saving in storage and, for matrices of N'^ that are not “too sparse”, also in a 
saving in execution time PTT^ . 

The method also applies to generalized stochastic Petri nets (GSPN) [T], 
with the limitation that components interact through timed transitions [nn], 
or through places that are connected only to timed transitions [TO]. GSPN are 
SPN where certain transitions (named “immediate”), can fire in zero time; these 
transitions have priorities over timed ones, and among immediate transitions it 
is possible to define different priority levels. The constraint of timed synchro- 
nization is a severe limitation for a number of applications, since many activities 
connected to the acquisition of resources, to communications of the rendez-vous 
style, or to choices, are more adequately modelled by immediate (zero timed) 
transitions. Moreover the assignment of different priority levels to transitions is 

* This research is partially supported by CNR Short-term Mobility program and 
Deutsche Forschungsgemeinschaft, SFB 559. 
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common practice to avoid the problem of probability specification for immedi- 
ate transitions known as “confusion” [2, or simply to implement priority of a 
process over another one in the acquisition of a resource. The problem of syn- 
chronization over immediate transitions was tackled in for the case of all 
immediates belonging to the same priority level, but the prototype implementa- 
tion presented in m is not available. The approach followed in PI, is to work 
at the embedded discrete time Markov chain level, and to use a diagonal matrix 
to remove the “anomalies” in the matrix due to the presence of priorities. 

The contribution of this paper is to define a theoretical framework for the 
Kronecker based solution of GSPNs that interact through transitions that are 
either timed or immediate at different priority levels. We extend the result in m, 
provide a new proof for it, describe Jacobi and Gauss-Seidel iteration schemes 
for numerical analysis, which became part of the APNN Toolbox [3] . 

The paper is organized as follows: Section |2] introduces basic definitions. 
Section 13] presents the extension of the Kronecker expression to the case of syn- 
chronization over immediate transitions. Section |4]discusses algorithmic aspects, 
i.e. how to generate a Kronecker expression for a SGSPN and how to perform 
steady state analysis using Jacobi or Gauss-Seidel iteration schemes. Section |3] 
exercises an example and Section |6] concludes the paper. 

2 Basic Definitions 

We assume that the reader is somewhat familiar with GSPNs and their dynamic 
behavior [Tj, and we briefly recall definitions in order to fix the notation. 

Definition 1. A GSPN is an eight-tuple {P, T, a, I, O, H, W, Mq) where P is the 
set of places, T is the set of transitions such that TC\P = %, a \ T^N is the 
priority function, I,0,H : T ^ Bag{P), are the input, output, and inhibition 
functions, respectively, where Bag{P) is the multiset on P, W : T x —>■ 
assigns a (possibly marking dependent) weight to each transition t with aft) > 1 
and a rate to each transition t with a{t) = 0, Mq : P —>■ Nq is the initial marking: 
a function that assigns a nonnegative integer value to each place. 

Te = {t € T\a(t) = 0} is the set of timed transitions, T/ = T\Te is the set of 
immediate transitions. A transition t such that a{t) = k is said to be a “level k 
transition”, and we can partition the set T accordingly: T = {Tq, Ti, . . . , Tr-}, 
where Tfe = {t\a{t) = k}. We denote *t {t* and °t) the set of input (output 
and inhibitor) places of transition t. For marking M : P ^ Nq we use also 
the vector notation (M also is a vector in N( ). A transition t has concession 
in marking M iS M > I{t) and Vp G °t : Mfp) < H{t)(p). A transition t 
is enabled in marking M (denoted by if it has concession, and -'At' 

that has concession in M, with aft') > aft). The firing of t in M produces 
the marking M' = M — Ift) 0{t), and it is denoted as M[t> M' . As a 
consequence of the distinction between concession and enabling, all transitions 
enabled in a given marking belong to the same priority level, and therefore the 
set RS of reachable markings can be partitioned into K subsets RSk = {M G 
RS such that 3t G Tk andM[t>}. RSq is usually named tangible reachability set 
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states which enable only timed transitions are termed “tangible” and all other 
states are “vanishing”. For GSPNs, well-known techniques apply to derive a 
transition rate matrix R from the tangible reachability graph, such that the 
underlying CTMC has a generator matrix Q = R — D with diagonal matrix D, 
where D(z, j) = ^(*> k) if i = j and 0 otherwise (D = rowsum (R)). 

Superposed GSPNs cm are GSPNs where, additionally, a partition of the 
set of places is defined, such that SGSPNs can be seen as a set of GSPNs which 
are synchronized by certain transitions. The following definition differs from the 
classical one in narz] since it allows synchronization on transitions of whatever 
priority level. 

Definition 2. A SGSPN is a ten-tuple {P,T,a,I,0,H,W,Mo,II,TS) where 
{P,T,a,I,0,H,W,Mo) is a GSPN, II = {P^ , . . . , P^~^} is a partition of P 
with index set IS = {0 , . . . , N — 1}. 

A SGSPN contains N components (P% T®, a*, /*, R*, PF*, Mg) for i G IS, 

with r® = •(P®)U(P®)*U(P®)° and a®, P, Mg are functions a, I, O, H, Mg 

restricted to P®, resp. P®. IC{f) = {i G IS\t G P®} is the set of involved compo- 
nents for t G T. TS = {t G P|1 < |/C(t)|} is the set o/ synchronized transitions. 
Furthermore we require marking dependent weight functions to be of the form 
W{t,M) = wt ■ YlieiC{t) W\t,M^) with wt G R+ and : T x ^ R+ . 

Note that U induces a partition of transitions on T\TS since for t G T\TS it 
exists a unique i G IS : *t LI t* Li °t C P^. Gonsequently transitions in T\TS are 
called local transitions. Marking dependencies of “multiplicative form,” as in the 
definition, can be taken into account by the SGSPN solution algorithms, but, 
for ease of notation, we shall assume that all rates and weights are assigned only 
a constant value Wt. The case of SGSPN where synchronization transitions are 
timed has been largely treated in the literature: the partition into components is 
used to represent Q by a sum of Kronecker products defined on matrices which 
result from isolated components. The definition of the Kronecker product 
and sum ^ can be found in [14j . let us just remember here that there are two 
equivalent ways to refer to rows (or column) of a Kronecker product/sum of N 
matrices of dimension M x as a tuple (s^, . . . , s^), with 0 < s® < n®, or 
as an integer s corresponding to (s^, . . . , s^) in the mixed based representation 
(n\...,n^). 

Any component i of a SGSPN is a GSPN. It can be analyzed in isolation 
yielding PS”®, assuming it is finite. Finiteness is an important restriction which 
must be taken into account for a partition into components, P-invariants can 
sometimes help m- Let all PP® be finite in the sequel, then the product state 
space PS (also known as potential state space) is defined as PS = x A ^PP® resp. 
PSq = x^L^RSq if only tangible states are considered. Synchronized transitions 
may cause |PSo| << l-P'S'ol as observed in |lHll5j . an undesired effect which is 
possible for PS as well. Each component in isolation has a generator matrix 
Q® = R® - D®. Matrix R® does not contain vanishing markings any more, i.e., 
the elimination of vanishing markings is applied separately at each component, 
since the condition that synchronized transitions have to be timed makes the 
behaviour of immediate transitions local to their own component, such that a 
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global normalization is not necessary [^. Matrix R* can be seen as a sum of 
matrices R® = such that nonzero entries are separated according 

to the timed transition t which contributes to that entry. Matrices of unsynchro- 
nized (local) timed transitions are summed up in R| = X)tGT*\TS I^idex I 
will be used to denote “local” matrices summarizing local transitions and index 
t to denote a specific synchronized transition. 

Theorem 1. ( \17^ ) Let G be a matrix of dimension PS = x^q^RSq^ , defined 
as follows: G = R? + Titers R-t where R) is the identity matrix 

if t ^ T® . The rate matrix is the projeetion of G over RSq and it is impossi- 
ble to move from a reachable to an unreachable state: {G)rSo,RSo ~ 
{G)rSo,PS\RSo = 0 - 

The corresponding diagonal matrix D has a structured representation as well, 
but in our implementation it is pre-computed and kept in memory. 

Despite the fact that RSq C PS, the advantageous aspect of the Kronecker 
based approach is that the Kronecker expression can be directly used by appro- 
priate solution algorithms |5ll9j . without the need of storing the infinitesimal 
generator; moreover data structures for a Kronecker matrix vector multiplica- 
tion can be of size iRS'ol, instead of \PS\, according to [11117] . Finally, let us 
recall that performance analysis of GSPNs usually takes place by considering the 
associated semi-Markov Process which is either transformed a) to an embedded 
discrete time Markov chain (DTMC) with a {RS x RS) matrix of transition 
probabilities P or b) to a reduced continuous time Markov chain (CTMC), by 
an elimination of vanishing states, with a {RSq x RSq) generator matrix Q. A 
solution of the DTMC requires computation of ttP = tt, while the solution of 
the CTMC requires to compute tt with ttQ = 0 and ~ 1- The latter is 

recommended due to its smaller matrix dimension. 



3 Kronecker Expression with Prioritized Synchronization 

An extension towards synchronization over immediate transitions faces two ob- 
stacles: a) it is necessary to keep (at least certain) vanishing states, with the 
disadvantages explained above, so that the solution of the CSPN amounts to 
the solution of the embedded DTMC, and b) another, more serious, difficulty 
arises from the fact that priorities in CSPN are global, and this may have unex- 
pected consequences. Assume marking (Af^,M^) is reachable in a SCSPN with 
two components, and transition t ft') of component 1 (2) has concession and is 
locally enabled in the local marking (Af^), but t' has higher priority than t: 
therefore t has concession, but it is not enabled in marking (M^,M^). We can 
say that concession is “compositional,” but enabling is “global.” 

Let us consider a SCSPN S consisting of N components with indices 1, . . . , iV, 
and K -\- 1 priority levels {0, 1, . . . , AT}. For each component W we can define 
a CSPN A^® with modified priorities, in which all transitions have priority zero. 
As explained in [T], chapter 2, by ignoring priorities we produce a state space 
which is a superset of the original one. Observe that if the original system is 
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finite, by neglecting priorities we can get an infinite state space, but this can 
happen only in systems that are not covered by P-invariants. Let us define on 
this modified components matrices RJ,, (which is equivalent to defining it in 
the original system using concession instead of enabling), as: RJ,(s*, s'*) = 1 iff 
s' can reach s'* by firing t; let also R}.; = Under this hypothesis 

is a correct expression for the transition 

probability matrix of the modified SGSPN S. We want to use matrices gener- 
ated from the modified components iV' to define a Kronecker expression for the 
original SGSPN S. Since the state space of S can be partitioned into K + 1 
disjoint subsets RSo, . . . , RSk, according to the K + 1 priority levels, we can 
“slice” matrix P into K+1 matrices X^, each of size PS x PS. Each matrix will 
define the stripe of matrix P corresponding to states in RSk- Note that the local 
reachability sets are generated using the modified components, but the “slicing” 
is based on the priorities of the original net. We can then say that: 

Proposition 1. Given a SGSPN S and matrices R}, and RJ,; defined on the 

modified components as above, the transition probability matrix of S is ex- 
pressed by: P = (Efc=o '<^here matrices Xfc are inductively defined as 

follows 

/ N 

= E ^«*(g)Ri + 0RiK 

\tGTK 

K / N 

Xk = {I- ^^,(g)Ri+0Ri_^ 

j=k+l \tGTk i=l 

Proof: we start showing that the formula for P correctly computes the stripe for 
states in RSk- Observe that in the formula for Xk only transitions of maximal 
priorities are considered, and if t G Tk is enabled in a state x, then transitions 
of lower priority can have concession in x, but they are not enabled, and since 
the formula for Xk is exactly that used for SGSPN where all transitions have 
the same priority, then we can conclude that the stripe for level K is correctly 
computed. Assuming that the stripe for RSk is correctly defined by the formula 
above, then we need to prove that the same hold for other values of k. Let 
us start with K — 1, and consider the meaning of S = (I — i5(Xiy)), where S 
stands for selection, indeed S(a:,a:) = 1 implies that no transition of level K can 
fire in x, while S(a;,a;) = 0 implies that a level K transition can fire in x. If 
we consider the expression for Xk-i, we can write it as Xk-i = S • C where 
C = • (EteT. Ri + © Rj^ ). Observe that C correctly describes 

the contribution of transitions t G Tk-i in the SGSPN without priority. By 
premultiplying it by S we get a row equal to zero for all states x for which 
S(a;,a;) = 0, that is to say for all states in which a transition of level K is 
enabled, and we get instead row C(a;) for all states in which no transition of 
level K is enabled. Therefore S selects, among all rows of C, the contribution 
of transitions of Tk-i for whose states in which transitions have concession and 
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no transitions of priority higher than K — 1 are enabled. The proof can then be 
easily carried on by induction for a generic index k. □ 

The idea of using a matrix to “mask” the effect of transitions that have 
concession, but that are not enabled, is taken from m- we have here extended 
it to the case of multiple priority levels and we have provided a different proof. 



4 Analysis Algorithms 

Applying Proposition [T] for analysis requires to compute RJ. and R}j^, 
(I-Ef=fc+i^(X,)),and D-i for all transitions, priorities and components. We 
follow the approach proposed in, e.g. [9[T1T8] for classical SGSPN with certain 
adjustments. It includes three steps a) generation of matrices RJ. for each com- 
ponent i in isolation, b) a state space exploration to obtain the set of reachable 
states c) performing an iterative numerical method for either steady state using 
e.g. Jacobi or Gauss-Seidel type iterations or transient analysis using random- 
ization. Step b) is optional but recommended to allow solution algorithms to 
work with data structures of size RS instead of PS. 

Generation of matrices RJ ,R}j^ . The general principle is to perform a state space 
exploration for each component i in isolation with a slightly modified enabling 
rule to care for the effect of synchronizing transitions. 

Enabling rule: A transition t G T* is enabled in a marking x = M* iff the con- 
junction of the following conditions is fullfilled: 

1) M*(p) > I{t){p) for all p G 

2) M^Ip) < H{t){p) for all p G P* if H{t){p) ^ 0 

3) if no t' G T'-\TS with a{t') > a{t) is enabled. 

Gonditions 1 and 2 state that t must have concession at M® with respect to com- 
ponent i. It directly corresponds to the argumentation of Section Gondition 
3 is a necessary condition for enabling transition t, since if a local transition t' 
with priority ct{t') > aft) is locally enabled, then in any corresponding global 
state after composition at least one transition with a priority larger than a(t) 
is enabled. This improves the construction of Section where we focused on 
clarity, at this point we want to prevent RS^ from being unneccessary large. 
The enabling rule above can become even more strict if one observes upper lim- 
its provided by P-invariants [HI, since computation and validity of P-invariants 
does not change in the presence of priorities. 

From the reachability graph of each component, we obtain matrices 
RJ(M®, M'*) = 1 if M®[t> M'* and 0 otherwise, since we allow only for fixed 
rates/weightlH w*. However, the resulting RS^ is a superset of the projection of 
RS on P®, since the enabling condition for the local state space exploration is 
weaker than the original enabling rule for the complete SGSPN. 

Exploring RS and generating RSq, . . . RSk- Since PS is usually a large superset 
of RS it is often useful to restrict analysis algorithms to RS as e.g. in [T7]. In 
order to identify RS we perform a state space exploration using the Kronecker 

^ In case of marking dependent rates/weights RJ entry becomes RJ(M®,M'®) = 
W(M\t) 
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RSI 




component 1 



component 2 



— |o |— I component 3 



Fig. 1. Representation of RSq, RSi by a decision diagram 

representation as in PS], but we apply an enabling rule which considers priorities: 
Enabling rule: Transition t G T is enabled at marking M = (M^, . . . , M^) iff the 
conjunction of the following conditions holds: 1) t is enabled in all components 
i G IC{t) at M*, i.e. an •) yf 0 exists, and 2) if no t' gT with a{t') > a{t) 

is enabled. 

Firing of an enabled transition t at (M^, . . . , M^) gives {M'^ , . . . , M'^) using 
column indices of Condition 2 is formalized in Proposition [T] by 

matrices (I — X)j=fe+i however for an algorithmic treatment it is more 

straightforward to order transitions by priorities and to consider their enabling 
by starting with transitions of highest priority first. The first enabled transition 
determines the priority level for the current state, so that there is no need in the 
implementation to build matrices (I — Y^^=k+i ^O^j)) ^bis step. The resulting 
set RS is represented by a decision diagram as in m], Section 3.1, similar to the 
directed acyclic graph of [6], extended with full vectors per node. This structure 
encodes a reachable state by a path of length N-1 visiting nodes (M^, . . . , M^) 
and its position in an index set {0, 1, ..., RiS — 1} by summation of “offsets” 
along this path. We additionally partition RS into sets RSq, . . . , RSk using a 
decision diagram (DD) again, which contains K + 1 root nodes to access each 
set. This is useful to observe (I — X)j=fe+i ™ subsequent analysis, since 

instead of “masking” matrix entries, one can recognize any state that belongs 
to “stripe” k by the members of RSk as well. Fig. [U shows an example DD with 
RSq and RS\ for a model with 3 components. The path along the dotted line 
encodes 3 states due to 3 states in the terminal node, state (1,2,3) G RS\ is 
highlighted and its unique position in {0,1, ...RS — l}is6 = 3 + l + 2. The full 
vectors are used to determine in 0(1) whether a state exists in a node and its 
position in the state vector of the node, see EH for details. This datastructure 
is related to binary and especially multi-valued decision diagrams (MDD) [16], 
however it differs from a MDD which contains less information per node and 
has an additional level for terminal nodes. However, MDDs and DDs share the 
key idea of all decision diagrams: isomorphic substructures are represented only 
once. We will use DDs for two purposes: a) to recognize states being element of 
a set, and b) to translate row/column indices of the Kronecker representation 
with range x^^.yRS'' uniquely to an index set (0, 1, ... , RS — 1} and back. The 
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latter is used e.g. in numerical analysis to address the position in an iteration 
vector 7T of length RS. 

Numerical solution of the associated Markov chain We consider iterative 
methods of Jacobi and Gauss-Seidel type to solve ttP = tt. An iteration scheme 
according to the method of Jacobi for a stochastic matrix P turns out to be 
equivalent to the power method which gives tt' = ttP. A Gauss-Seidel iteration 
considers a matrix splitting P = B -|- U -I- L into a lower (upper) triangular 
matrix L (U) and a diagonal matrix B and performs tt' = ttLB'^ -|- tt'UB”^, 
where B = I = B“^ for a stochastic matrix - if we require that any transition 
fulfils O — / 7 ^ 0 - such that Gauss-Seidel simplifies to tt' = ttL + tt'U. Applied 
to the Kronecker expression of Theorem [1] we get for Jacobi: 

Xfc), and 



/ 



7T = TT 






D 



- Ef=.+i w, 

and {l-Ef=k+i (5(Xj)) are diagonal matrices which com- 



kN 



©L[Rij) 



Eto 

Note that 

mute, furthermore for an implementation all matrices for k = 0, . . . , 

can be represented a single vector of length RS since nonzero diagonal en- 
tries in — VlL.,, (5(XA) restricted to RS only belong to RSk and sets 



RSq^ . . . , RSk are disjoint. 

Analogously we get for Gauss-Seidel: 



TT = TT 



K K N N 

k—O j—k-\-l tGTSfc 






+7T 



K K N N 

k—O j—k-\-l tGTSfc i—1 i—1 



A Jacobi and a Gauss-Seidel iteration differ by the Kronecker matrix vector 
multiplication algorithms which can be applied [^, Jacobi allows for a multi- 
plication by rows, while Gauss-Seidel requires a multiplication by columns. In 
the presence of priorities a key issue for performance is in the treatment of 
(i-Ef=fc+i J(X_, )) in combination with the restriction on RS. 

Jacobi iteration Fig. [2] describes a single iteration step for Jacobi in pseudo 
code. Diagonal matrix is represented by a single vector of length RS since 
unreachable states need no normalization. Iteration vectors tt , tt' have length 
RS as well. We use AT -I- 2 DDs, one for RS denoted by DD{), and one for 
each RSo, . . . , RSk denoted as DDo{), . . . , DDk{) to translate between vector 
positions in tt, tt' , and vector and the row/column indices of Kronecker 
matrices R(,R(. Glearly, an implementation keeps all these DDs in a single DD 
with K + 2 root nodes, in order to maximize sharing of substructures. The 
notation DDk{s\, . . . ,Si) identifies a DD node for component j -|- 1 in a path 
along states si, . . . , Si starting at the root node for RSk- 

The basic ideas of the algorithm are 1) to premultiply tt with D~^, 2) to 
partition states according to priorities into DDs and to consider sets of states in 
an order according to priorities, such that for each state in this set (represented 
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One iteration of Jacobi type 



1 . 

2 . 

3. 

4. 

5. 

6 . 

7. 

8 . 
9. 

10 . 

11 . 

12 . 

13. 

14. 

15. 

16. 

17. 

18. 

19. 

20 . 
21 . 
22 . 

23. 

24. 

25. 

26. 

27. 

28. 

29. 

30. 

31. 

32. 

33. 

34. 

35. 



for s € RS : 7r(s) = 7r(s) • D“^(s) ; 
for each k £ {0 , . . . , K} and for each t G TSk do 
for each si € DDk() do 
for each Ri(si,Si) 7^ 0 do 
ii = off{DDJ),sr) ; h = of f(DD(), s',) ; 

xi = Ri(si, si) • wt ; 

for each S2 € DDk(si) do 
for each R|(s 2,S2) yf 0 do 

i2 = ii+off{DDk{si),S2) ■,j2=ji+off{DD{s'i),s'2) ; 

X2 = Xl ■ Ri(S2, S2) ; 

for each S3 G DDk{s,, S2) do 

for each sn £ DDk{si, S2, ■ ■ ■ , sjv-i) do 
for each 11 % {sn, s'^f) / 0 do 
*iv = *iv-i + of f{DDk{si, S2, . . . , sjv-i), Sn) 

Jn = Jn-i + of f{DD{s'i, s'2, ■ ■ ■ , s'n-i), Sjv^) I 
^ Un) = 7T (jiv) + 7r(ijv).- a;iv-i • R}v(siv, Sn) 
for each k G {0 , . . . , K} and for each R[j^ in a Kronecker sum for priority k 
for each si G DDk{) do 
ii = offiDDkO, si) ; ii = offiDDO, si) ; 
for each S2 G DDk{si) do 

for each Si G DDk{si, S2, ■ ■ ■ , Si-i) do 
for each It\{si,s'i) 0 do 
ii = ii-i + of f{DDk{si,S2 , . . • , Si_i), Si) ; 
ji = ii-i + off (DD{si,S2 , . . . , Si-i), s') ; 
for each Si+i G DDk{si, S2, . . . , Si) do 
ii+i = ii + off{DDk{si, S2, . . . , Si), Si+i) ; 
ji+i = ji + of f(DD{si, S2, ■ ■ ■ , Si-,, s'i), Si+i) ; 

for each sn € DDkis,, S2, ■ ■ ■ , sn-\) do 
In = tN-i + of f(DDk{si, S2, ■ ■ ■ , sjv-i), sjv) : 

Jn = Jn-1 + of f{DD{s,, S2, ■ ■ ■ , Si-,, s'i, Si+i, . . . , sn-i), Sn) 
'^'Un) = tt^Q'jv) + Tr(iiv) • aiiv-i • TI%{sn, s'n) : 
normalize tt' ; e = ||(7r/D ^(s)) — vr'|| ; tt = tt' ; tt' = 0 ; 



Fig. 2. Jacobi iteration with matrix vector multiplication by rows 

by DDk) only the Kronecker representation for transitions of the appropriate 
priority are selected. This is an algorithmic reflection of the matrix formalization 
(I — X)yLfc+i Furthermore 3) the index j of the column state s' is ob- 

tained from the DD{) of RS since it can belong to any DD of DDi{), . . . , DDk{)- 
4) the matrix vector multiplication follows the one recognized as most efficient 
for Jacobi type iteration in [^. Note that matrices RJ^ summarize firings of 
possibly many local transitions, such that several entries per row can appear fre- 
quently. 5) function of /{node, state) gives the offset for a state by an 0(1) index 
determination from the nodes’ full vector. Note that one has to follow each level, 
even in case of identity matrices to obtain all indices i and j and to consider 
all relevant matrix entries. The treatment of a Kronecker sum as a sum of Kro- 
necker products clearly illustrates this point. Reachability of states and states 
which enable t are mutually checked at each level, such that a failure of one con- 
dition can be quickly recognized and precomputed parts of ■ ■ ■, • • ■ 

and xi,X 2 , ■ ■ ■ are subject to reuse. Matrices of the Kronecker representation are 
accessed by rows. 



212 S. Donatelli and P. Kemper 



One iteration of Gauss Seidel type 

1. for each s' = (s'l, s' 2 , ■ ■ ■ , s'jv) £ € {0, . . . \RS\ — 1} do 

2 . new = 0 ; 

3. for each k € {0 , . . . ,K} do 

4. for each t G Tk do 

5. for each Ri(si,s'i) 7 ^ 0 with si G DD^Q do 

6 . ii = off{DDk{), si) ; xi = Ri(si, si) • wt : 

7. for each R|(s 2 , si) 7 ^ 0 with S 2 € DDk{si) do 

8 . i 2 = *1 + off(DDk{si), S 2 ) ■, X 2 = xi- Ri(s 2 , si) ; 

9. for each R|(s 3 , si) 7 ^ 0 with S 3 G DDk{si, S 2 ) do 

10 . 

11. for each H%{sn, s'ff) 7 ^ 0 with sjv G DDk{s\, S 2 , ■ ■ ■ , sjv-i)do 

12. ijv = fiv-i + of f{DDk{si, S 2 , . . . , Siv-i), sn) ; 

13. new = new + Tr(ijv) • xn-i ■ II%{sn, s)v) I 

14. for each R* in a Kronecker sum for priority k 

15. ii^off{DDk{),s'i)\ 

16. *2 = ii + off{DDk{s'i), s' 2 ) ; 

17. 

18. for each Ri(si, s') 7 ^ 0 with Si G DDk{s'i, si, ... , s'_i) do 

19. ii = ii-i + of f(DDk(si, S 2 , . . . , Si_i), Si) ; 

20 . ii+i =ii + o//(DDfe(si, si, . . . , s'_i, Si), s'+i) ; 

21 . 

22. iiv = iN-i + of f{DD{s'i, si, ... , s)_i, Si, s'_|_i, . . . , sn-i)' , s)v) i 

23. new = new + 7 r(i]v) • R)(si, s') ; 

24. e = max(e, | 7 r(j')/D~^(s') — new\); 

25. Tv{j) = new ■ D“^(s') ; 



Fig. 3. Gauss Seidel iteration with matrix vector multiplication by rows 



A distinct feature of the J acobi iteration above is that vector tt' accumulates 
values in an order which does not allow to distinguish finished entries from 
incomplete ones during computation. This makes it useless for a Gauss-Seidel 
type iteration which wants to profit from newer, completely calculated values in 
tt' instead of using current values of tt. 

Gauss-Seidel iteration. Fig. [3] describes a Gauss-Seidel iteration step. It com- 
putes values of tt' one after the other. Golumn indices are indicated by s' = 
(s']^, S 2 , . . . , s)y) and index variable by j. Variable new is an accumulator for the 
value of 7r'(j), i.e. tt' {s'). Since a state s' can be reached by firing a transition of 
any priority level the algorithm need to consider all transitions. If a transition t 
with priority k may cause a state transition (s, s'), it is still necessary to check t 
being enabled in s, i.e. to check whether s G RSk. Hence the algorithm consid- 
ers both conditions at each component and can detect a failure before reaching 
component N. Gauss-Seidel makes less use of precomputed results than Jacobi 
since it keeps the s' values fixed, while Jacobi considers ranges of Si and s' at 
each level. Matrices RJ,, are accessed by columns. The influence of is re- 
flected in the last line. The impact of matrices (I — implicit in 

the algorithm by selecting only states Si in DDk{). Gompared to previous work 
EH] we use additional DDs for each priority in order to be able to focus on 
states which enable a transition of a certain level of priority k. The Gauss-Seidel 
approach can be further enhanced by matrices diagrams, as in an approach 
which is orthogonal and which has not been implemented yet. In summary a 



Integrating Synchronization with Priority into a Kronecker Representation 213 



numerical analysis of Jacobi or Gauss-Seidel type can be performed by using 
the algorithms of Figures [2] and |3] successively starting from an arbitrary initial 
distribution over RS, which e.g. can be chosen as 7 To(M) = 1 if M = Mq and 
0 otherwise, e is used to compute max\'K'{i) — 7t'(z)| as a simple but common 
criterion to detect convergence if e falls beyond a given threshold, alternatively 
one might compute residuals by one additional matrix vector multiplication. 



5 Analysis of an Example and Tools Integration 



Models of software with hardware resource acquisition motivated this work, 
hence we selected an example of this kind from literature [4]. The workload 
consists of 4 types of tasks called A, B, C and D. The first two on a first cpu, 
and the remaining two on a second cpu. Communication can take place on the 
same cpu (requiring cpu and memory), between the two cpus (through a dual 
access memory), or with the external environment (through one of the cpu links). 
Priorities were used to indicate priority of synchronization among tasks over re- 
source acquisition, and two levels of priorities for immediate transitions allowed 
to avoid confusion. With respect to the original model we have here experimented 
with the case when the external environment is always ready to communicate 
with the tasks, and for this communication there is no need to acquire external 
resources. This resulted in a modified behaviour in the software, since the branch 
corresponding to a communication with the external environment can always be 
taken, and this choice has priority 2. To analyze the model we define a partition. 
Tasks A, B, (two each) the first cpu and its associated memory are collected into 
a single component named Swl; a similar choice for the other task types and cpu 
leads to component Sw2; the dual memory (that is common with the two cpu’s), 
is considered as a separate component named Dm. The state space generation 
of the components yields 15 states for Dm, 9087 for Sw2, and 9061 for Swl. The 
crossproduct has dimension 1,235,060,000 while \RS\ = 2,469,626. The DD for 
RS requires 4 nodes (210,932 bytes), the DD for RSq, . . . ,RSk uses 11 nodes 
(459,816 bytes). For K = 3 priorities, we have 12 Kronecker products and 4 Kro- 
necker sums which use altogether 4,781,332 bytes. An iteration vector requires 
19,757,008 bytes. Experiments are performed on a Sun Enterprise 250, 400 Mhz 
CPU, 2CB main memory running SunOS 5.7. We observed 13.7 s for a Jacobi 
iteration step and 95.7 s for a Gauss-Seidel one. However, checking liveness of 
the net indicates that 26 transitions are dead although there is only one strongly 
connected component, of size 2,469,625. This suggested that priorities need to be 
modified if the external environment is not present, as the presence of a choice 
of priority 2 into a conflict with a single input place has killed part of the model 
behaviour. However, due to lack of space we only give results for a Kronecker 
representation where dead transitions are removed. Computation times drop to 
4.5 s for a Jacobi step and 23.8 s for a Gauss-Seidel step for the same(!) DTMC. If 
we perform an elimination of vanishing states and solve the embedded CTMC of 
dimension {RSq x RSq) with RSq = 737, 881 by Gauss-Seidel using the classical 
sparse matrix (4,253,738 nonzero entries) a single iteration step takes 1.1 s. 
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Modelling and analysis of this example employed the GreatSPN |8] package 
to draw the net and to graphically specify components as layers, conversion soft- 
ware to the APNN format for SGSPNs and several analysis tools of the APNN 
Toolbox [3], which have been extended for our approach to Kronecker repre- 
sentations with priorities. At this stage, apart from state space exploration and 
numerical analysis, also functional analysis is supported, e.g. to check liveness of 
a net. These tools can be employed in combination with GreatSPN, that is used 
to define the nets, to provide a number of structural analysis checks, to compare 
results with the GreatSPN solvers and simulator. 

6 Conclusions 

We presented a Kronecker based solution for SGSPN where multiple priority 
levels are allowed, and synchronizing transitions can be immediate. This signifi- 
cantly enlarges the class of real application models that can be treated with the 
Kronecker approach. 

The numerical solution is based on DTMG, instead of GTMG, with the dis- 
advantages of being forced to keep all vanishing states. It is future work to define 
and implement vanishing marking elimination at least for subnets of immediate 
that are not involved in synchronization among components. The structure “by 
stripes” of the expression may yield to very sparse matrices, and it may be worth 
to group components to try to reduce the number of matrices, or to reduce the 
number of stripes by eliminating certain priority levels applying the ideas in |Zj. 
Due to lack of space we did not consider the complexity of the given algorithms 
for Kronecker matrix vector multiplications and only state that the impact of 
priorities is that one need to take care of existence of a matrix entry (concession 
of the corresponding transition) plus its validity (enabling according to priori- 
ties). For Jacobi iterations, this has a minor effect due to a partitioning of states 
according to priorities, however for Gauss-Seidel iterations one need to check 
both conditions during iterations. 

This work represents also an additional step in the line of integration of 
tools developed at different universities, in this case the APNN Toolbox of the 
Dortmund group, and the GreatSPN of the Torino one. 
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Abstract. Stochastic Petri Net Package (SPNP) is a software package 
whose goal is to compute performance, availability or performability mea- 
sures from Stochastic Petri Nets (SPN) and Fluid Stochastic Petri nets 
(FSPN). This software can use either analytic numeric methods, or simu- 
lation methods. Unfortunately, the standard discrete event simulation is 
inefficient to estimate the probabilities of rare events. For such rare event 
simulations, importance splitting technique is a good method to speed- 
up the simulation. In the literature, two different importance splitting 
techniques are known: RESTART and splitting. In this paper, we de- 
scribe the application of these methods to (both fluid and discrete) Petri 
nets, their implementation in SPNP and we give some illustrations of the 
speed-up. The RESTART technique has already been applied in another 
Petri net package, TimeNet, but here we implement both RESTART 
and splitting, and we apply them to a more general class of Petri nets 
including the fluid ones. 

keywords: Fluid Stochastic Petri Nets, Importance splitting techniques. 
Rare events simulation. Stochastic Petri Nets. 



1 Introduction 

Petri net (see for instance |2llXll9j ) is a powerful paradigm to specify, model and 
evaluate complex systems in manufacturing, computers, telecommunications, en- 
gineering, and so forth. In short they are defined by places, transitions, links 
between places and transitions, and by tokens moving between places (or cre- 
ated or removed) when transitions fire. The firing time of a transition can be 
deterministic or random with a given distribution. 

Due to the fluid or hybrid nature of some systems, and also to circumvent 
the state space explosion of certain discrete Petri nets where we consider the 
tokens in certain places as fluid. Fluid Stochastic Petri Nets (FSPNs) have been 
introduced mg as an extension of the Stochastic Petri Nets (SPNs). 
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To evaluate these models, many analytical, numerical or simulation methods 
are available, depending on the properties of the system. A software package, 
called SPNP (for Stochastic Petri Net Package) has been developed | 1I5| where 
the Petri net is described in CSPL, an ANSI C language library. Then the user 
needs to specify his model and to specify the solution method he wants to use, 
with its possible parameters. In this paper, we focus on simulation methods and 
we add a new technique to the software package. 

As a matter of fact, simulation is often the only available method to compute 
performance availability/performability measures of a computer or telecommu- 
nication system when the state space is large and/or many non-exponential 
distributions are involved. However, a standard simulation may be inefficient to 
evaluate rare events; for instance it should take on the average 10^'^ independent 
replications to obtain just one time an event of probability 10“^*^, so it is nearly 
impossible to obtain a reliable statistical estimation. A common approach to 
speed up the simulation is to use importance sampling techniques, but to do so, 
we need to have a deep knowledge of the studied system and, even in such a 
case, importance sampling may not provide any speed-up PE]. 

We present here another group of methods to improve the standard simula- 
tion: importance splitting techniques. These techniques have been employed since 
the fifties on physics problems, but have been only recently used on telecommu- 
nication and reliability problems. There are two “schools” for telecommunication 
applications of these methods: chronologically, the first one has been developed 
in [2‘2f‘2.'ip‘24|‘25|‘2fi| where the technique is called RESTART and the second one 
is represented by iTTOU] where the technique is called splitting. Other works 
can also be found for instance in 114117120! . 

In this paper, we apply both splitting and RESTART to SPNs and FSPNs 
by implementing them in SPNP. RESTART has already been implemented in a 
Petri net software, TimeNet P, but the implementation handles only SPNs. 

The paper is organized as follows. First, we describe stochastic Petri nets and 
fluid stochastic Petri nets in Section |2I We briefly present the global idea of im- 
portance splitting in Section 1.4.1 1 then its specific use in splitting and RESTART 
(resp. in Sections 14. 21 and 1,4. 4|1 . We will try to always keep in mind that our goal 
is to implement these techniques in SPNP, so that we will be able to show the 
problems that might occur in this specific application as well as in its implemen- 
tation (Section HI . It needs also to be clear that using this technique, we will 
concentrate on the estimation of a single (or very few) measures of the system 
in a simulation process. Numerical illustrations are given in Section O and we 
conclude in Sectional 

2 SPNP: Stochastic Petri Net Package 

Petri nets provide a powerful paradigm in their ability to model complex systems 
(see for instance [18I1H] for a description). We deal here with Stochastic Petri 
Nets (SPN) including transitions with general firing distributions as well as with 
their extension. Fluid Stochastic Petri Nets (ESPN), where fluid represents many 
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tokens in some places to circumvent the state space explosion (see [4] for a 
simulation analysis of FSPN and [T^ for an analytical one). We briefly describe 
the models here, so the reader is advised to see the above references to understand 
the precise formalism. 

A general SPN is given by a ten-tuple 

{V,T,a,g,>,d, r,F,u;,m°) 

where V is the set of places, T is the set of immediate and timed transitions, 
a describes the marking-dependent cardinality of the input and output arcs 
connecting transitions and places, g is the guard of each transition depending 
on the current state, >, d, r, and F describe the transition priority, distribution, 
resampling and affected relation when another transition fires, to is the weight 
function for the transitions to define randomly which one will Are when there is 
a conflict and rrfi is the initial marking. 

An FSPN is an extension of a general SPN where some places can be fluid and 
some fluid can flow through transitions following given rules. Some new functions 
need to be introduced to define precisely an FSPN, in addition to those for SPNs: 
/, defining the marking-dependent fluid rate of the arcs connecting transitions 
and fluid places where fluid flows continuously, b, specifying the bound on each 
continuous place and the initial marking in fluid places. Remark also that now 
function a defined previously defines also the marking-dependent fluid impulse of 
the input and output arcs connecting transitions and fluid places. The dynamics 
of an FSPN is then governed by discrete events (firing of transitions modifying 
the discrete marking) and continuous flows in fluid places between these firings. 

Both SPNs and FSPNs are implemented in Stochastic Petri Net Package 
(SPNP) mm . where users describe their SPN or FSPN in CSPL, a library of 
the ANSI C language. In this package you can use either analytic numeric or 
simulation methods. The implementation of analytic numeric methods is detailed 
in [1] , and the one of simulation methods is thoroughly described in [Sj . 

3 Importance Splitting Techniques 

3.1 Basic Idea of Importance Splitting 

Principle Suppose that we wish to estimate by simulation the probability of 
a rare event A. As a standard simulation is inefficient, we try to improve it by 
defining thresholds where we can split the standard simulation path. Consider 
A: -I- 1 sets Bi such that A = B^+i C • • • C i?i and use the formula 

P{A) = P{A\Bk)P{Bk\Bk-i) ■ ■ ■ P{B2\Bi)P{Bi) (1) 

where each conditioning event on the right hand side of equation ([TJ is “not 
rare” . 

For instance, A may represent, in a network model, the event “loss of a 
customer in a buffer of finite capacity AT” , and Bi may represent the event “the 
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number of customers in the buffer is greater or equal to a given value 6/’ (of 
course, we have 6i < 62 < • • • < if to ensure that A = B^+i C • • • C Bi). In 

reliability/availability, A may be the event “reaching the set of failed states” in a 
specified model and the Bi a set of states containing A and such that Bi+i C Bi 
Vi, so that Bi is more likely to occur for a small i. 

The idea of importance splitting is to make a Bernoulli trial to see if the (not 
rare) set B\ is hit. If it is the case, we split this trial into i?i trials and we look 
(still by a Bernoulli simulation) for each new trial if B2 is hit. We repeat this 
splitting procedure at each level if a threshold is hit, i.e., we make Ri retrials 
each time Bi is hit by a previous trial. If a threshold is not hit, neither is A, so 
we stop the current retrial. By this procedure we have then considered R± ■ ■ ■ Rk 
(dependent) trials, considering for example that if we have failed to reach Bi at 
the step, the Ri - ■ ■ Rk possible retrials have failed. This constitutes a saving 
in computational time. Using Rq independent replications of this procedure, an 
unbiased estimator of P{A) is 



P = 



Rq 



Rq 



■ Rk . , 



Rk 

•El 

1 






( 2 ) 



where ligi^...i. is the result of the Bernoulli retrial at stage j (its value is 
1 if it is successful and 0 otherwise). An estimator of the variance can also be 
easily derived, as usually done in Monte Carlo methods when using independent 
replications. 

If the thresholds are well set, this estimator may be efficient. It can be proven 
j25lj that the optimal simulation is obtained if the number of thresholds is k = 
— l/21n(P(A)) — 1 and the thresholds are such that P{Bi\Bi-i) = and 
R, « l/P(B,\B,_i) = e2. 



Complications in the Case of Rare Events in SPNs and FSPNs. The 

case of the estimation of a measure of a rare event in a Stochastic Petri Net (and 
for other related formalisms) is more complex: as the steady state distribution 
is (generally) unknown, we can hardly check directly, from a simple random 
variable, if a subset of markings is hit. In this case, we use paths traversed 
through the reachability graph of the SPN starting from an initial marking, so 
we are not in steady state. 

Another important point is that here the optimal Bi and Ri for each level i 
may be difficult to find theoretically as the simulation computational effort may 
vary with the current state and, even if the optimal values of P(Bi^i\Bi) and Ri 
are known, the probability to reach Bi varies generally with the state of entrance 
in level i?i_i (as several may be possible). Therefore, the most challenging work 
on importance splitting techniques is to find good thresholds mm- If they 
are not well determined, the simulation might be inefficient. Unfortunately, to 
determine these thresholds is a really difficult task: suppose that, in an SPN or 
an FSPN, we wish to compute the probability that the number of tokens (or the 
level of fiuid) in a place p is larger than or equal to x, i.e., P(#p > x). A natural 
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choice is to take Bi as the set of states such that > bi where the threshold 
bi needs to be determined, but this kind of threshold might not be optimal. As 
a matter of fact, if we consider for instance p as the second queue of the two 
queues in a tandem network example | 8I9| . whatever the bi are (at the second 
queue), the optimal efficiency cannot be obtained if there is a bottleneck at the 
first queue. However, efficient thresholds can be found for this example, but 
they depend on both queues and are very specific. This example illustrates that 
for a general SPN, it is really difficult to determine good thresholds (and still 
unknown in general) , so we will consider here the natural ones of the form given 
above. 



3.2 Splitting 



This method is the one discussed in umm, where it has been thoroughly 
mathematically studied for Markovian models. Optimal efficiency can be ob- 
tained under some conditions. 

Let 0 be the initial state of the system. The measure estimated in [I7|8I9I1 0] 
using splitting is the probability 7 of reaching a rare set A before returning to 0. 
This measure may be useful to estimate for instance the expectation of the time 
to obtain a large number x of tokens or fluid in a place p, i.e.,E{Tp^x), by using 
the formula E{rp^x) = E{mm{To,Tp^x))/l where 7 = P{Tp^x < "nj) and tq is the 
time before returning to state 0. Here the numerator can be efficiently estimated 
by a crude simulation, so, using importance splitting techniques to estimate 7, 
we can obtain an estimator of Eixp^x) (the variance can be estimated as well, 
see for instance msm)- 

To apply this technique, we need to make the assumption that the process 
returns to the initial state infinitely often. The estimator of 7 is then the one 
described in Equation o, turning A and Bi (z = into respectively 

{reaching A before returning to 0} and (reaching Bi before 0}. Figure [T] describes 
how a path from 0 works: If B\ is hit before going back to 0 (it is in Figure [J), 
we split the path in i?i trials, otherwise, if we are back to 0 first, we stop the 
simulation. For each retrial (path) starting from the point of entrance in B\, if 
we hit B 2 before returning to 0 (case of one path in Figure [T|), then we split this 
path in R 2 trials, otherwise we stop this path (case of one path in Figure[l|). We 
do the same thing at each level for each new retrial trying to hit the next level. 
Finally, a path from level Bk {k = 3 in Figure [IJ which hits A before returning 
to 0 is considered as a success so we stop also this path (this is the case when 
the product of indicator functions in Equation (| 21 ) equals 1). 

The splitting estimator we will implement in SPNP has two differences with 
previous work. First it considers a maximum simulation time T (specified by 
the user) for each path, so that we estimate 7 = P{Tp^x < min(r, tq)). Using a 
large value of T leads to the original quantity. As a second point, an implicit 
assumption in all the initial importance splitting techniques of ISHD! is that we 
cannot go from level z — 1 to level z -I- 1 without entering level z. This may be a 
serious limitation for our application since a lot of tokens or fluid can instanta- 
neously arrive in a place. In m, the case where a transition can transport the 
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Fig. 1. An example of a trial with two splits {Ri = 2) at each threshold 

system more than one level is considered and treated in the same way. We will 
do the same thing here. For each simulation path zq, we consider the increasing 
sequence of hit levels 1 < l(zo) < • • • < < k where, for clarity 

sake, j{io, zi, • • • , z^-i) is simplified by j when it has already been defined. The 
continuous time estimator is then modified by 



It is actually the estimator in CD where some thresholds may be skipped. Again, 
the variance is estimated as usually done in Monte Carlo methods using inde- 
pendent replications. 

As it may take a long time to go back from Bi to 0, we can also gain in 
simulation time by stopping the simulation of the retrial splitted at level z when 
it is back d levels down. By then we assume that it will not hit . This induces 
a bias which is difficult to estimate j7j . 

3.3 RESTART 

RESTART (for more explanations see [1 7f22|2.S|24|25j2f)] ) is based on the same 
idea as splitting. It is more general with respect to the measures which can be 
estimated, but it also requires some mathematical approximations. 

RESTART can be used to compute rare transient events, but is often used 
to compute the probability P{A) of every kind of rare event in steady state, 
not only the probability of reaching A before coming back to 0, which is an 
advantage with respect to splitting. Thus we can compute the probability of 
loss in a given place and the probability of having more than a given amount of 
resource (discrete or fluid) in a place. 

As was the case for our splitting estimator, our RESTART estimator in 
® can handle jumps of levels, which was not the case in the estimator of 
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|22I23I24I25| . As for splitting, we start from a state called 0 (we make Rq in- 
dependent replications) and split the trials when thresholds are reached. The 
initial weight a; of a path from 0 is 1. To explain the differences with respect to 
splitting, consider a trial beginning in level i. We split the path when we hit a 
level I > i (or an upper one) and make Ri retrials. The weight of each retrial 
is then the one of his “father” multiplied by 1/i?/. Otherwise, if we leave level 
i towards level i — 1 first, we consider the path as finished (it is the case where 
we diminish the computation time with d = 1 in the previous section). This 
stopping procedure is done for all but the last retrial at level i. As a matter of 
fact, the last is authorized to go under level i. It continues then until it reaches 
again level i (or an upper level), and is split again. To parry the effect of the 
prematurely stopped first retrials, the current weight uj of the last one is multi- 
plied by Ri if it reaches a lower level before going to an upper one. By then we 
do as if all the retrials were grouped again together in a single one. 

Assume that the system is ergodic. As P{A), the steady state probability of 
A, is given by 

P{A) = lim i f lA(t)dt, 

T^oo 1 Jq 

our estimator of P{A) is, using a sufficiently large T (or using a given T if we 
consider the transient case). 







( 4 ) 



where N counts the number of finishing paths and where the way to split and 
to stop is the one described above, i.e., the weight cvi{t) (varying with time) is 
given by the previous procedure. 

Note that then, we are not in stationary regime since we start from a given 
state, so we need to choose T large enough. This induces a bias. To estimate the 
variance, we consider Rq independent replications of 6D. 



4 Implementation in SPNP 

SPNP |1 13l4j is a modeling tool for the solution of SPNs and FSPNs. Recall that 
an SPN (or an FSPN) is described in CSPL. A CSPL file must specify some 
functions [T]. Important ones are: options where the options (like the method 
we wish to use as well as its parameters) must be present; net which defines the 
SPN; and ac Jinal where the details of the requested model solution and of the 
user-defined output are called. 

Different options defined in options determine the method the user wishes 
to use and its parameters. We use the existing simulation options: 

— IDP_SIM_RUNS to determine the number of independent replications 

— and FOP_SIM_LENGTH to determine the maximum simulation time T of a path. 



Implementation of Importance Splitting Techniques 223 



The new options we introduce are the following: 

— IDP_SIM_RUNMETHOD value is VAL_RESTART if we wish to use RESTART for 
estimating P{=ffp{t) > x)dt/T by (HD, and its value is VAL_SPLIT if we 
wish to use splitting for estimating P{Tp^x < min(T, tq)) by (HD- 

— IOP_SPLIT_LEVEL_DOWN is a specific option of splitting which determines the 
number d of levels the path is stopped if it goes d levels down. Then the 
computational time is reduced but the estimator is biased [^. If this option 
is not specified, the simulation is not stopped by this procedure. 

— IOP_SPLIT_RESTART_FINISH is a specific option of RESTART. Default value 
VALJIO means that we use the estimator such that the Ri — 1 first retrials 
at each level i are stopped when they go under threshold i. VAL_YES means 
that each retrial is continued until the maximal simulation time T or until 
it reaches an upper level. 

— Finally, we need to specify the choice of the thresholds. We consider the 
natural choice described in Section 13.11 taking Bi as the set of states such 
that > bi, even if this kind of threshold might be non-optimal. We give 
the choice to the user: 

• either he/she decides to choose the thresholds himself/herself, so he/she 
assigns in options the option IOP_SPLIT_PRESIM to VALJIO. He/she then 
assigns the number of thresholds to IOP_SPLIT_NUMBER and the threshold 
values in table FOP_SPLIT_THRESHDLDS. If the threshold values are not 
specified, they are chosen uniformly between the initial value in place p 
and the value x of the probability P(#p > x) we are looking for. 

• Or he/she runs a presimulation (IDP_SPLIT_PRESIM=VAL_YES), which 
works as follows. Each threshold is determined one after another. To 
determine the first one, we start from the initial state and for the next 
ones, we start from a state stored the first time the level has been reached. 
We then run a standard discrete event simulation, using a number of in- 
dependent runs given by IDP_SPLIT_PRESIM_RUNS, for a time T minus 
the estimated mean time to reach the previous levels. The value of the 
threshold fy+i, when bi has been determined, is done by separating the 
interval [bi , x] in subintervals and seeing the probability of reaching these 
sub-intervals. We choose for threshold bi+i the inferior bound of the in- 
terval for whose probability is the closest to and Ri+i as the inverse 
of the estimated probability. 

In [I7| , the choice of levels is improved during the simulation by deter- 
mining the average percentage of time spent in each possible discrete 
value of the studied place. This can not be applied to FSPNs as the 
range of values is continuous and partitioning it in small subintervals 
may considerably increase the simulation time for then the number of 
events may be considerably increased. Indeed, each time the content 
value of the fluid place enters a new subinterval should be considered as 
an event to update the statistic variables. 

For each method, the importance splitting procedure is called in ac_final 
by inserting splitting (name_o/_pZace p,x). 
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To prevent the storage requirement from growing exponentially in the super- 
critical case by running all trials at level i before running the trials at level 
i + 1, we need to create a special structure of the elements of the model which 
need to be replicated at each splitting point and we need to adapt all the existing 
procedures to this structure. Then each time a threshold is reached, the structure 
will be stored, and the “child” paths are simulated one after another from this 
structure. When all the “children” are simulated, the stored structure will be 
removed, so that only the structure of the “parents” of the current path are 
stored in memory. The cloning structure needs to contain 

1. the splitting level; 

2. the retrial number (necessary for RESTART); 

3. to know if, for the last retrial when using RESTART, we have been under 
the initial level, so that we can resplit when hitting the initial level again. 
An internal variable, initialized to 0, is set to 1 if the last retrial goes under 
its initial level, and it is set to 2 if we have a resplit at this initial level (to 
end the path in EndPath when we will be back in this path after all retrials 
will be over in the recursive procedure); 

4. the current weight of the path; 

5. the clock; 

6. the marking; 

7. the lists of current enabled transitions and their clocks (their firing time are 
resampled conditionally in case that they are larger than the current global 
clock) . 

5 Numerical Illustrations 

5.1 Dual- Tank Example 

We consider the ESPN example of |4I5| representing a dual tank system and 
described in Figure [2l It contains a main tank “One” and an additional tank 
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“Two” (of respective maximal capacity bi and 62 ). Fluid flows with rate from 
an external source into tank “One” , which sends the liquid to a processing station 
with rate rout > fin- However, the processing station is subject to breakdowns 
and repairs (exponentially distributed with respective rates A and fi). During a 
breakdown, the station is not fueled and the liquid from the external source is 
immediately redirected to the additional tank “Two” . The external source is shut 
down only when tank “Two” is full. When the processing station is repaired, the 
external flow is immediately switched to tank “One” which resumes its work. In 
addition, the liquid in tank “Two” is pumped into tank “One” with rate r 2 i. If 
tank “One” is full, the flow from tank “Two” to tank “One” is slowed (to rate 

font f in\ 

For our computations, the parameter values are bi = 1.0, 62 = 2.0, A = 0.1 
/r = 1.0, Tin = 0.08, rout = 1-0 and r 2 i = 0.97. The initial marking is given by: 
the processing station is up, the level in tanks “One” and “Two” is 0.0. 

We estimate first the cumulative probability that the plant is down during 
the interval of time [0,T] with T = 100, i.e., the probability that tank “Two” is 
full. The 10% relative error confidence interval (with confidence level 95%) using 
the RESTART estimator and a presimulation is 

[9.56926e- 13,1.16952e- 12] 

and is obtained in 209 seconds (including the presimulation) on a Sun SparcSta- 
tion Ultra 60. 

Using the same model, we also estimate the probability to have a shutdown 
before time T and before coming back to the state where both tanks are empty. 
The 10% relative error confidence interval (with confidence level 95%) using the 
splitting estimator is 



[1.03963e- 10,1.27057e- 10] 

and is obtained in about 315 seconds (including the presimulation). 

Using the same example but using a rare event parameter e such that rtn = £ 
and r 2 i = 1.05 — e, we can compare in Figure [3] the speed-up obtained for 
RESTART with respect to a standard discrete event simulation (computing the 
accumulated time place “Off” is not empty). We use log- log scale for the axes 
to keep the figure readable as the speed-up grows very fast with 1/e. For large 
values of £, the standard discrete event simulation is faster because the event 
is not rare and the overhead for RESTART represents the presimulation time 
leading to no thresholds. The improvements using RESTART are made as soon 
as we have one threshold. For £ = 0.14 (leading to an estimate of about 5.7e — 7) 
the observed speed-up is about 2284. Value e = 0.08 (l/£ = 12.5) is not included 
on the graph as after few days the discrete event simulation was still not able to 
display a 95% confidence interval with 10% relative error (actually no shutdown 
event was reached). 

Note that importance splitting techniques are not always suitable to estimate 
the probability of a rare event: suppose in our previous example that the failure 
of the processing station is rare, i.e., A 1. As tank “Two” is filled up only 
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Fig. 3. Computation times to obtain a 95% confidence interval with 10% relative 
error for the standard discrete event and RESTART simulators on the dual tank 
example 



when the processing station is down, we need to speed up its failure at first. 
This can not be handled by using importance splitting, but simply by applying 
importance sampling. Thus, both importance splitting and importance sampling 
methods have their own advantages and the choice depends on the properties of 
the model. 

5.2 Reader and Writer Example 

We also illustrate the speed-up due to importance splitting on an SPN not in- 
volving fluid places, but general distributions. We consider the example of 
representing n processes in an operating system sharing a buffer for reading or 
writing (see Figure B). For reliability reasons, no more than k{< n) processes 
are authorized to be in reading mode. In Figured the model is described using 




Fig. 4. Reader and writer sharing buffer example 
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several places representing different states; we have the “inactive” state (for each 
token representing a process), the “ready to read” state where the processes are 
waiting for the server to read, the “reading” state, the “ready to write” state 
and the “writing” state. Place “at most fc” is introduced to express the condition 
that not more than k processes can be in reading mode. When a process is in 
the “ready to write” state, the processes in the “ready to read” state cannot 
be given to the buffer because the data will be modified (this is represented in 
Figure |4] by the inhibitor arc) and the processes in the “reading” state exit at 
once (according to the immediate transition “exit buffer”). Finally, firing times 
of transitions “reading request” and “writing request” are assumed to be expo- 
nentially distributed with respective parameters 4.0 and 1.0, “server for reading” 
and “server for writing” are uniformly distributed between values 1.0 and 2.0 
and “reading over” and “writing over” assumed to follow a normal distribu- 
tion with respective couple (mean, variance) (2. 0,0. 5) and (3. 0,1.0), truncated 
to non-negative values. 

Figure |5] shows the speed-up obtained (in computation time) by using the 
RESTART technique compared with the standard discrete event simulation 
when estimating the probability to have k = n — 2 processes in reading mode as 
n increases. 

Here again, for small values of n corresponding to non-rare events, the stan- 
dard discrete event simulation is more efficient because of the computational 
overhead of the unnecessary presimulation. As soon as one threshold is required 
by the presimulation {n = 16), the RESTART estimator becomes more efficient 
than the standard one, and the speed-up increases with n. Again, this speed-up 
can only be illustrated in Figure 0 on not very rare events in order to obtain 
results for the standard simulation, but the curves allow us to extrapolate the 
kind of speed-up that can be obtained using importance splitting. 




Fig. 5. Computation times to obtain a 95% confidence interval with 10% relative 
error for the standard discrete event and RESTART simulators on the reader 
and writer example 
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6 Conclusion 

In this paper we have described the implementation of importance splitting 
techniques in SPNP, a software package for solving SPNs and FSPNs. We have 
also illustrated the speed-up that we can obtain thanks to these methods. 

The most challenging work for future research is to find and to implement an 
efficient algorithm to determine good thresholds. As mentioned in Section 13.1 1 
in determining the probability to have more than a given number of tokens (or 
more than a given level of fluid) in a place, defining the thresholds as a given 
amount of tokens (or fluid) in the place may be inefficient. But since no general 
method is known, we use these kinds of thresholds. 
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Abstract. In this paper, we present a tool for the simulation of fluid 
models of high-speed telecommunication networks. The aim of such a 
simulator is to evaluate measures which could not be obtained through 
standard tools in a reasonable delay or through analytical approaches. 
We follow an event-driven approach in which events are associated only 
with rate changes in fluid flows. We show that, under some loose restric- 
tions on the sources, this suffices to efficiently simulate the evolution 
in time of fairly complex models. Some examples illustrate the utiliza- 
tion of this approach and the gain that can be observed over standard 
simulation tools. 



1 Introduction 

Computer networks continue to evolve, increasing in size and complexity. When 
we have to analyze some aspect of the behavior of an existing network or to design 
a new one, the most widely used tool is simulation, both because of its power 
in representing virtually every possible mechanism and system, and because of 
its flexibility. The main price to pay is in programming and computation costs: 
simulator programs are usually difficult to develop and they may need large 
amounts of resources in time and sometimes also in space. The best alternative 
to simulation is to use analytical techniques. In general, they have the advantages 
of being several orders of magnitude less expensive to apply and, moreover, of 
frequently leading to a deeper insight into the properties of the analyzed system. 
The drawback of analytical methods is that they require conditions on the models 
that are hard to satisfy. A third possibility is to use numerical techniques, which 
usually range between simulation and analytical methods, in terms of cost and 
required assumptions. 

In this paper we are interested in the analysis by simulation of high-speed 
communication networks, in order to quantify important aspects of their behav- 
ior. These aspects include performance, dependability properties, quality of ser- 
vice, etc. Communication networks are examples of very complex systems, where 
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simulation is the only tool able to analyze in some detail the associated mech- 
anisms. More specifically, we are interested in the behavior of communication 
networks where information is sent in discrete units, with complex scheduling 
mechanisms and interactions between different components of the system, and 
using realistic models of sources. Our references are ATM networks transporting 
cells and IP networks where the information travels in packets. The classical 
approach to simulate such a system is to follow the event-driven approach where 
each message (cell, packet) is represented in the model, together with its evolu- 
tion through the different nodes. If we consider now a high-speed network, we 
easily see the problem that may arise when millions or billions of units must be 
generated and moved through the simulated model. For instance, to validate a 
design in the ATM area, where the engineer must check that the loss probability 
of a flow arriving at a specific switch is of the order of 10“®, at least hundreds 
of billions of cells should be sent to the switch in order to obtain a minimal 
accuracy on the result. 

To deal with this problem, a possible approach is to try more sophisticated 
simulation techniques that can lead to the same precision with less computational 
effort (importance sampling, splitting, etc.). The drawback here is similar to that 
of analytical methods, though less restrictive: the applicability conditions can be 
too hard to fit. For instance, some techniques of this kind only concern Markov 
models, whereas others work for single queues of some particular class. 

The approach chosen in this paper is the simulation of continuous state mod- 
els (fiuid models), where the fiow of discrete units traveling through the lines and 
stored in the buffers, is replaced by fluids going from one container to another. 
This can lead to a significant reduction in the computational effort. Indeed, when 
a long burst of cells or packets is sent through a line (which happens very often), 
instead of handling each individual unit as with a classical simulator, it suffices 
here to manage only two events: the beginning of the burst and its encQ Our 
paper describes a tool designed to simulate such a fluid model and to take ad- 
vantage of this potential computational effort gain. It must be observed that the 
use of continuous state models as approximations of discrete time ones is not 
new: this has been already done in queuing theory for many years. Think simply 
of the diffusion processes used as approximations of standard queues or of queu- 
ing networks in heavy traffic conditions. Fluid models representing high-speed 
communication networks are commonly used nowadays, specially in the ATM 
area, but mainly for analytical purposes. It must be underlined that just a few 
models can be analyzed this way: mainly single node systems with very simple 
scheduling mechanisms and strong assumptions on the behavior of the sources. 
To the best of our knowledge, there are almost no results on multiclass models 
and/or on models with more than one node. 

The structure of the paper is as follows. Section 2 makes a short description 
of the dynamics of fiuid models. Section 3 introduces the class of fiuid models 
under consideration. Section 4 discusses some issues related to the simulation of 

^ In fact, this is strictly true only for the sources; nevertheless it illustrates the key 
phenomenon. 
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such a system following an event-driven approach. In Section 5, the current state 
of our tool, FluidSim, is presented and Section 6 illustrates its use by means of 
two examples. Some conclusions and current research directions close the paper 
in Section 7. 



2 Dynamics of a Fluid Model 

Let us consider first a single fluid reservoir or “buffer” of capacity B < oo, 
with a constant output rate c G (0,oo), and a work-conserving FIFO service 
discipline. Let A{t) G [0,oo) be the total rate of fluid being fed into the buffer 
at time t > 0, such that every sample path A{t) is a (right-continuous) step- 
wise function. Then A(t) = J* A{u) du is continuous and piecewise linear. This 
condition is not very restrictive since many fluid traffic models satisfy it. Some 
examples are: ON/OFF rate processes, either Markovian [T] or non-Markovian 
j?]; Markov-modulated rate processes, for which A{t) = ({Z{t)), where Z{t) is 
the state of a Markov chain at time t and C() is a given function [7j; renewal rate 
processes A{t) = ^ [«5n-i, -Sn)), where the Xi’s form a sequence of 

i.i.d. random variables and the Sn — Sn-i form a renewal process independent 
of the Xi’s l3][7]- 

Let Q{t) be the volume (“level”) of fluid in the buffer at time t > 0. The 
evolution of Q{t) is described by: 

Q{t) = Q{0) + f {A{s) - c)l(s G Q)ds, t > 0, (1) 

Jo 

where, for an infinite-capacity buffer, Q is given by Q = {s > 0 | /l(s) > 
c or Q(s) > 0} |7|, and for a finite-capacity buffer, Q = {s > 0|(/l(s) > 
c and Q(s) < B) or (d(s) < c and Q{s) > O)}. For right-continuous stepwise 
input rate functions, this integral reduces to: 

Q(T„+i) = min {b, (Q(T„) + (d(T„) - c)(T„+i - T„)) + } (2) 

where T„ denotes the n-th transition epoch of d(f); we take Tq = 0- The resulting 
sample paths Q{t) are piecewise linear, with slope Q{t) = {A{t) — c) l{t G Q). 
Slope changes occur either at the time instants where the buffer becomes full or 
empty, or at the transition epochs of A(t). The output rate at f > 0 is 

R{t) = cl(Q(t) > 0 or A{t) > c) -I- A(t)l{Q(t) = 0 and A{t) < c) . (3) 

For the class of processes A considered in this work, we can deduce then that 
R{t) is also a right-continuous stepwise function. Buffers are non-blocking, that 
is, fluid arriving to a full finite buffer is lost. Cumulative fluid losses in the 
interval [0,f] may be computed by the integral J^(A(s) — c) l(s G O) ds, where 
O is the set of all overflow periods: O = {s > 0 | vl(s) > c and Q{s) = S}. For 
a given t, we denote by to(t) the beginning of the next empty period; likewise, 
tsit) denotes the beginning of the next overflow period. 
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The model described so far is applied to the more general case where the 
buffer is fed by N fluid flows. Let us denote by Xi{t) S [0,oo) the rate of the 
i-th flow at time t. We denote \{t) = (Ai(t), . . . , AAr(t)), and we call this the 
input flow vector. The total input rate is A{t) = ^Ai(t). As before, we shall 
be interested only in arrival processes such that, for every sample path, Xi(t) 
is a (right-continuous) stepwise function. It results that A{t) is also a (right- 
continuous) stepwise function. In the same way, ri{t) is the output rate related 
to the i-th input fluid, at time t, r(t) = (ri(t), . . . , rAr(t)) is the output flow 
vector, and R{t) = J2ri(t) is the total output rate. 

Let us denote by r„ the n-th transition epoch of the input flow vector X{t). 
Because of the FIFO service discipline, a change in A(t) occurring at t = will 
need Q{Tn)jc> 0 time units to propagate to the buffer output (the time needed 
to flow out the Q(r„) > 0 volume units already in the buffer). Then, at time 
Wn = 7Vi -I- Q(r„)/c, the proportion of output components must be the same as 
the proportion of input components; that is, if A(r„) > 0, then for every flow i, 
rflwn) / R{uJn) = Xi{Tn) / A{Tn) ■ Note also that a change in X{t) may produce two 
transitions in r{t) if Q(r„) > 0, A(r„) < c and the buffer becomes empty before 
the next transition in X{t). The evolution of r(t) is fully described by the values: 

r(w„) = < A(t„) 

[o if A(r„) = 0 and Q(r„) = 0 (in this case, = r„), 

( 4 ) 

and 

r(io(Tn)) = A(r„), if to(T„) < r„+i and Q(r„) > 0. (5) 

We have already introduced the expressions governing the behavior of the 
basic fluid components that conform a communication network: sources and 
buffers. The preceding analysis may be applied as well to a network of fluid 
buffers. Consider for example the case shown in Fig. [I] if all of the N M 
sources are stepwise functions, then the input flow vector of the second buffer 
is also composed of stepwise functions; hence, Qi{t) and Q 2 {t) satisfy ((21), while 
i?i (t) and i ?2 (t) satisfy ((2]) . Consequently, all the equations presented above can 
be applied to more general topologies than single nodes. 

A more detailed presentation of the dynamics of a fluid model can be found 
in [Hj and [S|. 

3 Fluid Model of a Communication Network 

Our tool considers fluid communication networks composed of the following ba- 
sic elements: fluid sources and sinks, multiplexers (with buffering capacity or 
bufferless), communication links and switching matrices. Built on these, we cur- 
rently have higher level objects such as different classes of switches, connections, 
etc. Let us informally describe here these elements from a modeling point of 
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Fig. 1. Two buffers in tandem 



view. In the next two sections, we discuss how our tool implements them, and 
how more complex objects can be built. 

As stated before, we only consider sources producing stepwise right-con- 
tinuous fluid rate functions. Sinks are destination nodes. Sources and/or sinks 
may perform other functions such as complex control mechanisms, by means of 
dedicated algorithms. Multiplexers are network nodes composed of one or more 
buffers with capacities > 0. Their functions are to merge the incoming flows 
according to some policy, possibly to store ffuid and, as for sources and sinks, to 
run in some cases algorithms implementing control protocols. A communication 
link connects two network components. It is a unidirectional element that intro- 
duces a constant delay d > 0 to every flow going through it. A switching matrix 
is simply a mapping between two sets of elements. Its function is to separate 
the incoming aggregated ffow vectors {demultiplexing) and to create new ffow 
vectors at its output(s) {multiplexing) according to a routing table. 

With the previously described elements, we define switches, composed of in- 
put and/or output buffers (which are multiplexing elements) and a switching 
matrix. We also define unidirectional connections, which are formed by a source, 
a sink, a unique fixed route and, possibly, an algorithm associated with some 
control protocol. Only point-to-point permanent connections are taken into ac- 
count, yet our models can be trivially extended in order to represent the process 
of connection establishment and cancellation, as well as the multicast case. Appli- 
cations requiring a bi-directional ffow (e.g., ATM’s ABR and ABT or Internet’s 
TCP connections) can be represented as two unidirectional connections between 
the two communicating end points, which are, for instance, a source and a sink 
with a controller (an algorithm) implementing the corresponding protocol. 

While a fluid model has been adopted for representing the network elements, 
it is also useful to add discrete objects, named fluid molecules, that can be 
emitted by the sources. They have no volume, so the buffer level is not changed 
by the arrival of a molecule. If a molecule arrives to a buffer that is overflowing, it 
is lost with probability {A{t) — c)/A{t). Observe that this probability corresponds 
to the instantaneous ffuid loss rate during congestion. Molecules can represent 
the behavior of individual entities such as RM (Resource Management) cells 
while simulating ABR and ABT ATM flows, or ACK messages in the case of a 
window-based flow control protocol. 
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Example of a fluid network. Figure shows the state of a section of a fluid 
network at some instant t. It contains many of the components introduced above. 
Connection 1, between source S\ and destination Di, is represented in light gray 
and traverses the first buffer and the switching matrix. Source S 2 belongs to a 
second connection represented in dark gray. Its flow traverses both buffers and 
the switching matrix and continues downstream. Connection 3, whose flow is 
represented in black, starts at source S 3 and traverses the switching matrix 
where it is routed towards the second reservoir and continues downstream. 

A change in the flow (ri,r 2 ) leaving the first buffer, will occur at time 
t + Q\jc\. The new flow will still have two components: the rate of the first 
connection will be > ri, while the one belonging to connection 2 will be < T 2 . 
In a similar way, the actual flow (0, ra) leaving the second buffer, will change at 
time t-VQ^j C 2 , producing a rate > 0 for connection 2 and < ra for connection 3. 
Molecule toi, which belongs to connection 1, will wait for Q\jc\ units of time 
before it leaves the first buffer. 




Fig. 2. Example of a fluid network 



4 Simulation of a Fluid Network 

The variables defining the behavior of a fluid network (Ai, Qj, etc.) fed by 
stepwise sources always follow piecewise linear sample paths. Therefore, in order 
to completely describe the evolution of such a network, we only need to know 
the state of its variables for a denumerable set of time instants, that is, the 
transition epochs for the functions Xi{t), ri{t) and Qj{t). This strongly hints 
at the use of discrete- event simulation as a simulation technique: every state 
transition in every variable will be related to an event handled by the scheduler. 
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In the following, we describe the specific list of event types considered in the 
model, and how they are handled in the simulation process. 

4.1 Event Classes 

Discrete events required for the simulation of our fluid network model may be 
grouped into the following main classes: (a) events related to the sources, e.g. 
“the rate Xi{t) of source i changes”; (b) events related to the buffers: e.g. “the 
input ffow vector Xj{t) of buffer j changes”, “the slope of the backlog of 

buffer j, changes”, “the output ffow vector rj{t) of buffer j changes”; (c) events 
related to the switching matrices, e.g. “the input ffow vector Xj{t) of matrix j 
changes”; (d) events related to the communication links, e.g. “the input ffow 
vector Xj (t) of link j changes” . 

4.2 Event Handling 

In order to analyze the handling of events during the simulation of a complex net- 
work, we shall concentrate on a single reservoir S (see Fig. [3D, fed by N sources, 
whose ffow rates are denoted by Ai,...,Aat, and K upstream buffers, whose 
output ffow vectors are denoted by r“ , . . . , . These buffers are globally fed by 

M + J > K individual fluid flows, coming from other sources and/or buffers; 
from these, only M flows (numbered iV-|- 1, . . . , N + M) are fed into S. The input 
ffow vector of S is then X{t) = (Ai(t), . . . , Av(t), ■ ■ ■ , 

output ffow vector is r(t) = (^ri{t), . . . ,rN{t),rN+i{t), . . . ,rN+M{t)) ■ The total 
input and output rates are respectively denoted by A{t) and R{t). 



M + J 
fluids 




Fig. 3. Example of a buffer in a network 



4.3 Events Related to the Sources 

Rate changes. For the i-th source, a discontinuity in the sample path Xi{t) hap- 
pening at t = implies a transition in A(t), so to guarantee the synchronization 
of arrivals to S we should: (1) Handle all simultaneous events related to network 
components located upstream with respect to S, i.e., execute all tasks “change 
the rate Xj{t) at t = r„” and “change the vector r“(t) at t = r„”, (2) Lastly, 
schedule the event “A(t) changes at t = t„”. 
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Molecule generation. An event “the source i emits a molecule at t = t''" produces 
the following actions: (1) Create an object “molecule belonging to flow j” and 
initialize its particular fields if necessary (e.g. birth date for delay measurement 
molecules, see E), (2) Schedule the task “a molecule from flow i arrives at 
t = t'” for the downstream neighbor. 

4.4 Events Related to the Buffers 

Input rate transitions. An event “A(t) changes at t = t„” should trigger the 
following actions: 

1. Compute Q{Tn) and o;„. 

2. Calculate to{Tn), tB{Tn); then (a) schedule the task “S begins to overflow at 
t = tB(Tny\ if this has not been done yet and ts(r„) < oo, and (b) schedule 
the task “S becomes empty at t = to(Tn)”, if this has not been done yet and 

to(Tn) < OO. 

3. If yf to{Tn), schedule the task “change r{t) at t = 

However, computing to{rn) and tB{Tn) (and consequently taking the decision 
of scheduling or not an event “S becomes empty” or “S overflows”) poses some 
practical problems, because the evolution of A(t), and hence of every component 
of A(t), from t = t„ onwards must be known a priori] see the example in Fig. |4] 



A{t) c 



Tn rn+\Tn + 2 Tn + S 




Fig. 4. Computation of to{Tn) 



Nevertheless, there is a simpler alternative based on the following definitions. 

Definition 1 An input rate transition at time t = Tn is an overloading tran- 
sition if A{Tn) > c. The predicted overflow period will begin at tB{Tn) = 
[B - Q{Tn)]/[A{Tn) - c] + Tn. If A{Tn) < c, we define tB{Tn) = oo. 



238 



J. Incera et al. 



Definition 2 An input rate transition at time t = T„ is an underloading transi- 
tion if A{Tn) < c. The predicted empty period will begin at to{Tn) = [Q{Tn)]/[c— 
A{Tn)\ + Tn. If A(Tn) > c or Q(Tn) = 0, we define to(T„) = oo. 

Moreover, note that neither R{t) nor r(t) change because of an overflow; so, 
it is not necessary — in principle — to schedule an event “S overflows”, because 
the lost volume can be computed a posteriori. Therefore, steps I2l3l above may 
be replaced by the following three steps: 

2. If tsiTn-i) < Tn then S is overflowing, so take fluid loss measurements (for 
details, see i)- 

3. Calculate io{Tn) and tB(r„), then: (a) if there’s already an event “S becomes 
empty at t = f'” in the event list (with t' ^ to(Tn)), cancel it, and (b) if 
to(Tn) < oo, schedule the event “S becomes empty at t = to('Tn)”- 

4. If Wn yf to(Tn), Schedule the task “change r(t) at t = 

Figure E] shows a typical sample path for Q(t) and the associated A(t), as 
well as predicted overflow and “underflow” times; the curves correspond to the 
parameter values: c = 1, i? = 1. Note that, in this case, t_B(0) = ts(0) and 
to(4.5) = to(4.5) but ^ 3 ( 4 ) ts(4) and io(2) y^ to(2), i.e., there is an event that 

gets scheduled but canceled later. 





t 



Fig. 5. Estimation of to(Tn) and t_B(r„) 
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Buffer underflow. An event “S becomes empty at t = to(tVi)” should trigger the 
following actions: (1) make Q{t) — 0, and (2) execute the event “change r(t) at 
t = to{Tn)'\ that is, “now”. 

Output rate transitions. An event “r(t) changes at t should trigger the 

following actions: (1) Calculate r{t). (2) Schedule events “change X{t) at t 
for the downstream neighbor fed by S. 

The previous step should be synchronized with other similar actions, to guar- 
antee the generation of a single “input rate transition” event for every down- 
stream buffer. 

Molecule arrivals An event “a molecule arrives at t = a„”, with a„ G [r^, Tm+i), 
should spawn the following actions (for more details, see [S]): (1) If tsiTm) < 
then S is overflowing, so the molecule will be destroyed with probability (A(Tm) — 
c)/A(Tm). (2) If the molecule is not destroyed, then: (a) Calculate the waiting 
time W(an); (b) Schedule the task “a molecule arrives at t = On + VF(a„)” for 
the downstream network element that should receive this molecule. 

Of course, the arrival of special molecules (such as those representing RM 
cells to buffers that implement ABT/ABR protocols) will activate the procedures 
specific to them. 

4.5 Events Related to the Switching Matrices 

Input rate transitions. Let us consider a switching matrix in a switch with Nout 
output ports and let A“(t) be the aggregate input fluid vector. An event “A“(t) 
changes at t = rff' should trigger, for each output port m = 1, . . . ,Nouti the 
following actions: (1) Estimate the new input flow vector Xm(Tn) for the port 
according to the matrix’s routing table. (2) Schedule the task “ Am(f) changes 
at t = Tn” for port m. 

Molecule arrival When an event “molecule arrives at t = aff' occurs, the output 
port the molecule should be forwarded to, is obtained from the matrix’s routing 
table and a similar event “molecule arrives at t = a„” is programmed for that 
port. 



4.6 Events Related to the Communication Links 

Given that the link’s model is simply that of a delay element, when any event 
arrives to a link at t = t' , a similar event is programmed for the downstream 
neighbor at t = t' + d, where d > 0 is the link delay. 

5 Implementation of the Fluid Simulator 

In this section we briefly describe the prototype of our fluid simulation tool 
FluidSim. It consists of (a) a modular library of network objects that follow 
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the principles described so far, (b) a discrete-event simulation kernel and sup- 
port libraries, and (c) an optional set of routines that permit the definition and 
parameterization of complex fluid networks from a configuration file and/or a 
graphical interface. The latter routines also provide the environment for exe- 
cuting various simulation runs of the same network model in order to estimate 
confidence intervals. 

The simulator follows the object-oriented paradigm, has been implemented 
in C++ and benefits in an important manner from its recently standardized Stan- 
dard Template Library. The graphical interface has been coded using the Java 
programming language. 



5.1 The Simulator Kernel 

As explained before, the simulator is event-driven: the scheduler works indefi- 
nitely by selecting the next earliest event to execute, invoking the event handler 
of the concerned object and returning to select the next event to execute. An 
event usually creates new events that are scheduled, i.e., inserted in the event list 
according to the simulation time at which they should be executed. This cycle 
finishes when there are no more events to process or when a special event to stop 
the simulation is found. An event has several fields; among the most relevant are 
its type, its scheduled time (i.e. when the event should occur), the recipient of the 
event and a payload, by means of which information (flow vectors, molecules) is 
transferred among the simulated network objects. 

FluidSim’s kernel provides some specific facilities to support our fluid models 
of network elements, such as an efficient way for canceling scheduled events 
(Section 14.41) . the ability to identify and handle “simultaneous” events happening 
at the same simulation epoch (Section I4.,S|I . and the concept of a two-priority 
event scheme (a low priority event that occurs at the same simulation time than 
a high priority one, will be scheduled after the latter). 

FluidSim provides a family of classes for generating random variates (expo- 
nential, Pareto, etc.) based on a multiplicative linear congruential random num- 
ber generator. Multiple generators may be instantiated simultaneously, providing 
different streams. It includes facilities for collecting data during the simulation 
and for computing some statistical values (mean, standard deviation, histogram 
distributions, time-weighted averages, etc.). Finally, the kernel supports the col- 
lection of simulation traces in log data files for further processing. The user may 
decide to use different log files for the different variables being monitored, or to 
use a common default file for collecting some or all of them. 



5.2 Network Objects 

The hierarchy of the principal fluid network components implemented at present, 
is shown in Fig.E] All the network objects are derived from the abstract base class 
ActiveSimObj , from which they inherit the ability to send and process simulation 
events. Let us describe here some specific aspects of the implementation. 



FluidSim: Simulation of Fluid Models 



241 



ActiveSimObj 



FluidBuffer 

I 

FIFOFluidBuf 

I 

GPSFluidBuf 



GPSNode FluidSource Link FluidSinkSwitchingMatrixGonnection 



I I I rp(^p 

TraceDriven OnOff GlosedLoop 



ABR ABT TGP 



Fig. 6. Main hierarchy of FluidSim classes 



Traffic Sources. Our current implementation provides three kinds of fluid sources 
that obey the stepwise property. 

— On/Off sources. They switch between an active state in which they transmit 
at peak rate, and a silent state where no transmission takes place. Sojourn 
intervals at each state are i.i.d. random variables with arbitrary distributions. 

— Trace-based sources. The instantaneous transmission rate is calculated from 
files containing traces of real traffic, for instance, sequences of MPEG-1 
I/B/P frame sizes. 

— Closed-loop sources. At present, we are experimenting with three kinds of 
closed-loop fluid sources: ABR and ABT sources that transmit RM-type 
molecules and adapt its rate according to the molecules they receive following 
different algorithms. A fluid version of TCP’s window based flow control is 
also under study. 

Multiplexers. Basically, two kinds of multiplexing nodes are currently imple- 
mented: 

— FIFO fluid buffers with capacity B < oo. The limit case where the buffer 
capacity is R = 0 permits to simulate a bufferless multiplexing node. 

— GPS nodes where a set of fluid buffers are served simultaneously with differ- 
ent weights or service rates. A variant GPS + Best Effort is also provided. 

We also have support for implementing more complex mechanisms as in the 
ABR and ABT ATM service classes, or in TCP connections. For instance, we 
have already implemented a fluid version of the ERIC A/ERIC A-k ABR control 
algorithm m- 

Fluid Sinks. Fluid sinks are the end point of a connection where the arriving 
flow is absorbed and some statistics concerning the individual connection are 
collected if required. In addition to this basic fluid sink, the library provides 
objects that complement our closed-loop experimental components: ABT, ABR 
and TCP sinks. 

Links Communication links implement the elementary delay component de- 
scribed in Section 3. 
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Switching Matrix. Non-blocking zero-delay switching matrices are usually found 
as components of fluid switching nodes, although they are independent objects. 
They implement the demultiplexing and multiplexing functions described in Sec- 
tion 3. 

Connections. As mentioned before, currently unidirectional unicast connections 
are implemented. A connection always starts at a fluid source, follows a path 
composed of different network nodes and ends up at a fluid sink. 

Molecules. Molecules are considered as being part of the flow produced by a con- 
nection’s source (and it is always possible to identify the connection the molecule 
belongs to). Therefore, a source can only emit molecules at those intervals where 
its rate is strictly positive. They are not only used to implement control pro- 
tocols like ABR in the ATM area or TCP in the Internet context, but also to 
measure end-to-end delays experimented by individual connections (see 0). 

5.3 Configuration Interface 

Using FluidSim as a set of C++ libraries is appropriate while developing new 
network components or if one wants to integrate those libraries to existing C++ 
programs. However, when the goal is to simulate different network configurations 
(e.g. what-if scenarios) using the objects already provided, a more flexible en- 
vironment is offered: the network topology, object parameterization and general 
simulation data may be input via a graphical interface or through a configu- 
ration ascii file. In this mode, the simulator runs as a stand-alone program. 
A simulation driver reads the configuration file, creates the necessary network 
objects, starts the simulation and collects global statistics. If required, it fires 
different simulation runs with separate random number sequences for estimat- 
ing confidence intervals and resets the network object’s states between runs. The 
graphical interface provides the standard functions associated with this kind of 
tool: through the aid of icons, menu bars and mouse displacements, the user can 
create a network topology, define object parameters, copy and delete individual 
objects or groups of objects (sub-networks), etc. Once the topology is defined, 
the tool does some validations and creates the configuration file. The configura- 
tion file follows a simple syntax. Simulation objects are defined in blocks. A block 
is composed of a type tag and a body in brackets where the object’s parameters 
are defined. For space reasons we do not detail the whole syntax specified for the 
configuration files. To illustrate it, let us consider a network composed of two 
buffers (Buf A, BufB) and two connections. Connection 1 flows from source Srcl 
to sink Snkl traversing Buf A and BufB. Connection 2 flows from source Src2 to 
sink Snk2 passing through Buf A only. The configuration file is presented here in 
a two-column format. 
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# general parameters 
SIMULATION { 

DURATION 10000.0, 

WARMUP_TIME 45.0, 

SEEDS 9977581 234234, 

0UT_FILE MyRun . out , 

NUM_RUNS 30 } 

# RNG block 
RAND0M_GENERAT0R { 

rngl ; rng2 ; 

rng3 SEEDS 1234581 6688774; } 

# Source block. 

SOURCE { 

Srcl 0N_0FF 

PEAK_RATE 2048000.0, 
S0J0URN_0N 

DETERMINISTIC 0.4, 
S0J0URN_0FF 

PARETO 1.5 0.8 rng3; 

Src2 TRACE_DRIVEN 

FILE ATraceFile.dat; } 



# Sink block. 

SINK { 

Snkl; 

Snk2 STATS YES; } 

# Buffers’ description block 
BUFFER { 

BufA INFINITE 
RATE 51200.25; 

BufB FINITE 

RATE 888812.2315, 

POLICIY FIFO, 

SIZE 512000; } 

# Connection block 
CONNECTION { 

First 

PATH <Srcl->BufA-> 
BufB->Snkl> ; 

Second 

PATH <Src2->BufA->Snk2>; > 



6 Examples 

In this section we shall briefly present some results obtained with FluidSim. 
The first example was used for testing the fluid simulation tool. In particular, it 
allows to check its accuracy by comparison with analytical results. The second 
one was specifically designed to illustrate the gain in efficiency with respect to 
standard tools; we focus there only on the event processing rate. 

6.1 Comparison with an Analytical Model 

Let us consider a simple model composed of a single finite-capacity buffer fed 
by ten homogeneous ON/OFF sources, whose ON and OFF periods are expo- 
nentially distributed with mean 10 ms, and with a peak rate of 15 Mbps. The 
buffer has a capacity of 1 Mb and an output rate of 100 Mbps. We are interested 
in the complementary distribution Pr(Q > q) of the fluid level Q. This is an 
interesting topology for validation purposes, because it is simple enough so that 
very long simulation runs can be performed in a reasonable amount of time and, 
most importantly, the exact expression of Pr(Q > q) is known, see e.g. [5]. 

We performed 30 independent simulations. Each simulation run corresponded 
to a simulated interval of 200 x 10^ s (« 55, 5 hours) and took about one hour 
of computing time on a Sun UltraSPARC workstation shared between several 
users. The number of events processed in each run was « 2 x 10®. Note that the 
equivalent number of events that should have been processed by a conventional 
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simulator, if the discrete units were for instance ATM cells, would be ~ 3.5 x 
10^°. Figure 0 shows the fluid level distribution, with 95% confidence intervals 
(shown as small crosses); note the excellent agreement between theoretical and 
experimental values. 
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Fig. 7. Example 1: complementary distribution of fluid level Q 



6.2 Comparison with a Cell-Level Simulation Tool 

The following example highlights the eflrciency of the fluid simulation paradigm, 
with respect to the traditional discrete approach. The network under study is 
shown in Fig. |S] remark that this topology is vulnerable to the “ripple effect” 
first described by Kesidis et al. |3], which could degrade the performance of a 
fluid-model simulation [B]. 






Fig. 8. Example 2: network topology 
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For comparison purposes, we have selected the well-known NIST simulator 
which is a cell-level, ATM-oriented tool. We performed 30 one-second simulation 
runs in order to obtain mean values. The parameters used are as follows: 

— 10 homogeneous ON/OFF sources, with peak rate = 18 Mbps and expo- 
nentially-distributed ON and OFF periods with mean 10 ms. The mean 
burst size is then 22.5 KB « 424 ATM cells. All buffers are identical, with a 
capacity of 1 Mb. 

— Link has a transmission rate of 100 Mbps; and £3 operate at 50 Mbps. 

— Connections of traffic sources S\,. . . follow the route switch^ 

B 2 ^ £2 ^ B 4 , whereas the remaining connections follow the route B\ 

£1 — > switch ^ B 3 ^ £3 ^ B 5 . 

Simulations of the fluid model showed that Pr(Qi > 0) « 0.66, i.e., the buffer 
Bi holds fluid about 66% of the time; hence, about 66% of transitions in each 
source will affect the output rate of all flows going out of B \ , and so the influence 
of the ripple effect should be non-negligible [B] . 

Table [T] presents some results regarding the computational cost of cell-level 
and fluid-level simulations. Both the NIST simulator and our fluid simulator were 
compiled using the same development tools and run on the same architecture 
(a PowerPC-based computer); moreover, the NIST simulator was executed in 
non-interactive mode, so as to optimize its performance. 



Table 1. Example 2: results on simulation efficiency 



Simulator 


Mean execution time 

(s) 


Mean number of 
processed events 


Event processing rate 
(events/s) 


NIST 

FluidSim 


18.11 

0.166 


2.502 X 10® 
1.374 X lO"' 


1.382 X 10® 
8.277 X 10"^ 



We can see that, even though the treatment of fluid-level events is « 1.67 
times more expensive than the processing of cell-level events, the fluid-model 
paradigm allows for a reduction in the number of events that far outweighs the 
increased computational cost; remark that this is so in spite of the ripple effect. 
The speedup factor is 18.11/0.166 = 109.1. 

7 Conclusions 

The simulation tool we have introduced in this paper is particularly well suited to 
study high-speed telecommunication networks with arbitrary bursty sources. We 
have shown that it is possible to evaluate measures which would be very difficult 
to obtain through standard simulation tools. We give some results comparing 
FluidSim to a well-known simulator, illustrating the gain that can be obtained. 
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Of course, this tool is devoted to obtaining values of measures that we cannot 
get by means of analytical approaches, either because we work on the entire 
network or because we look for sophisticated metrics, or just because the model 
is not a simple one. This is also the case when we want to evaluate new protocol 
mechanisms in detail. 

The tool is written in C++, which allows in particular to easily extend it 
to deal with new architectures. We are currently working on some extensions, 
namely to deal with complex control algorithms. 

References 

1. D. Anick, D. Mitra, and M. M. Sondhi. Stochastic theory of a data-handling system 
with multiple sources. Bell System Technical Journal, 61(8):1871-1894, October 
1982. 

2. N. Golmie, F. Mouveaux, L. Hester, Y. Saintillan, A. Koenig, and D. Su. The NIST 
ATM/HFC Network Simulator — Operation and Programming Guide, version 4-0, 
December 1998. 

3. P. R. Jelenkovic, A. A. Lazar, and N. Semret. Multiple time scales and 
subexponentiality in MPEG video streams. In Proceedings of the International 
IFIP-IEEE Conference on Broadband Communieations, Montreal, Canada, April 
1996. f tp : //f tp . ctr . columbia.edu/CTR-Research/ comet /public/paper s/96/ 
JEL96.ps. gz, 

4. G. Kesidis, A. Singh, D. Cheung, and W.W. Kwok. Feasibility of fluid event- 
driven simulation for ATM networks. In Proceedings of IEEE Globeeom’96, London, 
November 1996. http : //cheetah. vlsi .uwaterloo . ca/'kesidis/f luid.ps 

5. V. G. Kulkarni. Fluid models for single buffer systems. In J. H. Dshalalow, 
editor, Frontiers in Queueing: Models and Applications in Science and Engineering, 
chapter 11, pages 321-338. CRC Press, 1997. 

6. B. Liu, Y. Guo, J. Kurose, D. Towsley, and W. Gong. Fluid simulation of large 
scale networks: issues and tradeoffs. In Proeeedings of the International Conference 
on Parallel and Distributed Processing Technigues and Applications (PDPTA’99), 
volume IV, pages 2136-2142, Las Vegas, 1999. 

7. J. Roberts, U. Mocci, and J. Virtamo, editors. Broadband Network Teletraffic: 
Performance Evaluation and Design of Broadband Multiservice Networks — Final 
Report of Action COST 242. Number 1155 in Lecture Notes in Computer Science. 
Springer, 1996. 

8. D. Ros and R. Marie. Estimation of end-to-end delay in high-speed networks 
by means of fluid model simulations. 13th European Simulation Multiconference, 
Warsaw, June 1999. 

9. D. Ros and R. Marie. Loss characterization in high-speed networks through simu- 
lation of fluid models. SPECTS’99 (1999 Symposium on Performance Evaluation 
of Computer and Telecommunication Systems), Chicago, July 1999. 

10. R. Jain, S. Kalyanaraman, R. Goyal, S. Fahmy, and R. Viswanathan. Erica switch 
algorithm: A complete description. Contribution 96-1172, ATM Forum, August 
1996. ftp: //ftp .net lab . ohio- state . edu/pub/jain/ atmf /atm96-1172 .ps , 



Exploiting Modal Logic 
to Express Performance Measures 



Graham Clark^, Stephen Gilmore^, Jane Hillston^, and Marina Ribaudo^ 



^ LFCS, The University of Edinburgh, Scotland 
{gcla, stg, jeh}@dcs. ed.ac.uk 
^ Dipartimento di Informatica, Universita di Torino, Italy. 
marina@di .unite . it 



Abstract. Stochastic process algebras such as PEPA provide ample 
support for the component-based construction of models. Tools compute 
the numerical solution of these models; however, the stochastic process 
algebra methodology has lacked support for the specification and cal- 
culation of complex performance measures. In this paper we present a 
stochastic modal logic which can aid the construction of a reward struc- 
ture over the model. We discuss its relationship to the underlying theory 
of PEPA. We also present a performance specification language which 
supports high level reasoning about PEPA models, and allows queries 
about their equilibrium behaviour. The meaning of the specification lan- 
guage has its foundations in the stochastic modal logic. We describe the 
implementation of the logic within the PEPA Workbench and a case 
study is presented to illustrate the approach. 



1 Introduction 

It has long been recognised that whilst Markovian models of simple computer 
systems can be constructed without explicit notational support, for complex sys- 
tems use of some high-level modelling formalism becomes essential. A variety of 
formalisms exist, for example queueing networks [1], generalised stochastic Petri 
nets (GSPN) [2], stochastic activity networks (SAN) [3] and stochastic process 
algebras (SPA) [4]. Unfortunately corresponding high-level notational support 
has not been developed for querying models and checking performance specifi- 
cations will be met. At best, some support for constructing a reward structure 
which captures the desired measure is provided. 

In this paper we use Performance Evaluation Process Algebra (PEPA) [4], 
a compact formal language for modelling distributed computer and telecommu- 
nications systems. PEPA models are constructed by the composition of compo- 
nents which perform individual activities or cooperate on shared ones. Using such 
a model, a system designer can determine whether a candidate design meets both 
the behavioural and the temporal requirements demanded of it. Markovian SPA, 
such as PEPA, are enhanced with information about the duration of activities 
and, via a race policy, their relative probabilities. Several such languages have 
appeared in the literature; these include PEPA [4], TIPP [5] and EMPA [6]. 
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Essentially these all propose the same approach to performance modelling: a 
corresponding continuous time Markov chain (CTMC) is generated via a struc- 
tured operational semantics; linear algebra can then be used to solve the model 
in terms of equilibrium behaviour. This behaviour is represented as a probability 
distribution over all the possible states of the model. 

This distribution is seldom the ultimate goal of performance analysis; instead 
the modeller is interested in performance measures which must be derived from 
this distribution via a reward structure defined over the CTMC [7]. A recent 
case study by first-time users of PEPA [8] reported that a significant proportion 
of the effort was spent in deriving the performance measures once steady state 
analysis was complete. 

The study of temporal and modal logics in conjunction with process alge- 
bras is well-established. These logics express properties of systems which have a 
number of states, and in which there is a relation of succession. A modal logic is 
used to express finite behaviour. In a temporal logic one or more operators are 
introduced allowing reasoning to be carried out over infinite behaviour. Over the 
last decade, process algebras have been extended to capture additional informa- 
tion about systems, such as the relative probability of choices and the timing 
of actions. Analogously, extensions have been made to the syntax of the logics 
which allow properties to be expressed which reflect the additional information 
being captured [9,10,11,12]. 

Here we present a stochastic modal logic and explain how it may be used 
to specify performance measures over a PEPA model. The logic has several 
attractive features: 

~ it expresses properties in a high-level manner, focusing on the possible be- 
haviours of the model rather than the states; 

— properties expressed in this way remain invariant under model transforma- 
tions such as automatic aggregation; 

— a specification can be constructed in a compositional manner reflecting the 
compositional structure of the model. 

Since we are interested in steady state behaviour it is perhaps surprising that 
we use a modal, and not a temporal, logic. However, as we will explain, we 
have found that a modal logic is sufficient for specifying a reward structure 
over a model assumed to be in equilibrium. This approach to specifying per- 
formance measures over PEPA models has been incorporated into the PEPA 
Workbench [13]. Finally, recognising that the logic expressions may be intimi- 
dating to some users, we have developed a high-level model query language. This 
language has foundations in the stochastic logic. 

Earlier work by Clark proposed the use of a modal logic to define the reward 
structure over a PEPA model [14]. While demonstrating feasibility, this work 
suffered from a major drawback. The logic used did not include any represen- 
tation of the timing aspects of PEPA and consequently does not have a clear 
relationship to the equivalence relations which have been established for the lan- 
guage, such as strong equivalence. In the current work we address this problem 
by developing a stochastic logic which takes full account of the random variables 
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used to represent the duration of activities in PEPA. An earlier version of this 
work appeared in [15]. Our reward language has now been extended to take ad- 
vantage of the compositional structure of models. Moreover, in this paper, we 
additionally describe the implementation of the approach and illustrate it using 
a more substantial case study. 

In the next section we give a succinct summary of the PEPA language and 
motivate the need for a formal notation for specifying the performance of a 
PEPA model. Since we provide only a brief summary of PEPA here, the reader 
should consult [4] for full details. The PEPA Reward Language and its associated 
stochastic modal logic are presented in Section 3, whilst the implementation is 
described in Section 4. In Section 5 we illustrate our ideas with a simple, yet 
realistic, example. Finally, conclusions and future directions for the work are 
presented at the end of the paper. 

2 PEPA 

PEPA (Performance Evaluation Process Algebra) extends classical process alge- 
bra with the capacity to assign exponentially distributed durations to activities, 
which are described in an abstract model of a system. It is a concise formal lan- 
guage with a small number of grammar rules which define the well-formed terms 
in the language. An activity of action type a performed at rate r preceding P 
is denoted by {a,r).P. Using the symbol T instead of a rate denotes passive 
participation in a shared activity. Choices are separated by -I-. Cooperation be- 
tween P and Q over a set L of action types is P M Q or P || Q if L is empty. 
Hiding the activities in L and thus denying their availability for cooperation 
gives the term P/L. The notation for definitional equality is =. The syntax may 
be formally introduced by means of the following grammar: 

S ::= {a,r).S \ S + S \ Cg 

P ::= P WP I P/L I C 

where S denotes a sequential component and P denotes a model component which 
executes in parallel. C stands for a constant which denotes either a sequential 
or a model component, as introduced in a definition. Cg stands for constants 
which denote sequential components. The effect of this syntactic separation be- 
tween these types of constants is to constrain legal PEPA components to be 
cooperations between sequential processes. This constraint is necessary for the 
underlying Markov process to be ergodic. 

Using the structured operational semantic rules of the language it is possible 
to generate, directly from a PEPA model, a continuous time Markov process 
which faithfully encodes the temporal aspects of the PEPA model. The PEPA 
Workbench is used to check the well-formedness of PEPA models and to generate 
their Markov process representation. It detects faults such as deadlocks and 
cooperations which do not involve active participants. It is described in full in 
an earlier paper [13]. The steady state distribution may be found by applying any 
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one of a number of linear algebra solution methods to the generator matrix. We 
have recently extended the Workbench with the capability to reduce models to 
a canonical form internally, thereby automatically aggregating the model [16]. 
This has considerable benefits in terms of tackling the state space explosion 
problem, but means that the states of the Markov process which is solved are 
no longer in one-to-one correspondence with the states of the PEPA model. 

The formal aspects of PEPA have been exploited in developing the mapping 
from the language to the Markov process and in the automatic aggregation 
techniques. However, the extraction of performance measures from the resulting 
steady state probability distribution has been a largely ad hoc procedure. A 
reward structure is used to calculate appropriate expectations over the state 
space but determining which states should have a reward attached has relied on 
the knowledge of the modeller, and such states were characterised as syntactic 
terms [17]. Apart from relying on the modeller’s insight, this technique also has 
the disadvantage of being incompatible with the automatic aggregation. Thus 
we have been motivated to develop a companion reward language for PEPA, 
centred on a stochastic modal logic, which characterises in behavioural terms 
the states to which rewards must be attached. 



3 The PEPA Reward Language 
and Stochastic Modal Logic 



In this section we introduce a stochastic modal logic, which is used at the core of 
our reward language. In particular, the expression, and testing for satisfaction of 
equilibrium properties, can be seen to be closely related to the specification, and 
model checking of a formula expressed in probabilistic modal logic (PML [18]). 
We give a modified interpretation of such formulae suitable for reasoning about 
PEPA’s continuous time models. 

Previous work by Clark [14] proposed an approach to generating measures 
using traditional Hennessy-Milner logic (HML [19]). The idea was to capture 
the set of ‘interesting’ states of the model by partitioning the state space with 
a formula of the logic — those states that enjoy the property are then assigned a 
reward, such as a number, or a value based on ‘local state’ information, such as 
the rate at which the state may perform a particular activity. All uninteresting 
states are given a reward of 0. In this way, a reward vector is formally specified, 
and equilibrium measures such as utilisation and throughput may be calculated. 
However, the method was not ideal for several reasons. Firstly, it was ad hoc — the 
logic provided an initial partition only, meaning that a calculational technique 
was required in addition, in order to assign reward values. Secondly, the logic was 
qualitative only, in that it disregarded the rate at which a PEPA process could 
perform an activity, and only captured the fact that an activity was possible. 
These inadequacies led us to base our recent work on a more appropriate logic, 
namely Larsen and Skou’s PML. 
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3.1 Probabilistic Modal Logic 

The syntax of PML formulas is given by 



F::= tt I I -T’ I Fi AF2 I {a)^F 

The models described in [18] are probabilistic, in that for any state P and any 
action a, there is a (discrete) probability distribution over the a-successors of P. 
Informally, the semantics of a formula Vq, is the set of states unable to perform 
an a activity; and the semantics of {a)^F is the set of states such that each 
can make an a-transition with probability at least ^ to a set of successors each 
of which satisfies F. We choose to modify slightly the interpretation of these 
formulae with respect to PEPA models. First we give a simple definition: 

(a,u) 

Definition 1 Let S be a set of states. P >S if and only if for all successors 

P' &S,P ^ P', and P' G S'} = ly. 

Now let P be a model of a PEPA process. Then 



P ^ tt 

P\=^F 
P 1= Pi A P 2 
Ph Va 

P h {c^)uF 



if P [A F 

if P [= Pi and P h P 2 
if P ^ 

if P >S for some v > p,, and for all P' & S,P' |= P. 



Thus, the subscript p present in formulae of the form {a)^F is now interpreted 
as a rate rather than a probability; if a state P is capable of doing activity a. 
quickly enough arriving at a set of states S each of which satisfies P, then P 
satisfies {a)^F. For the remainder of the paper, we will denote PML with this 
interpretation as PML^. 



3.2 Relation of PML^ to PEPA 

In [18], Larsen and Skou show that PML exactly characterises probabilistic bisim- 
ulation, in the sense that two probabilistic processes are bisimilar if and only if 
they satisfy exactly the same set of PML formulae. With our modification to the 
semantics of PML, an analogous result holds for PEPA processes: 

Theorem 1 (Modal characterisation of strong equivalence) Let P be a 

model of a PEPA process. Then 

P = Q if and only if for all F,P\=F if and only if Q \= F 

That is to say that two PEPA processes are strongly equivalent (in particular, 
their underlying Markov chains are lumpably equivalent [4]) if and only if they 
both satisfy the same set of PML^ formulae (in our modified setting) . A proof 
of this can be found in [15]. 
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This result is not just of theoretical interest. It guarantees that if a trans- 
formation is applied to a model, resulting in the model being replaced by a 
strongly equivalent model, then we can expect that the new model satisfies the 
same formulae as the original model. Moreover if rewards are attached to equiv- 
alent states then the performance measures derived from the new model will be 
equal to the measures which would have been derived from the original. 

The automatic aggregation procedure described in [16] and implemented in 
the PEPA Workbench, is based on the isomorphism relation of PEPA. However 
this relation is stronger than strong equivalence, meaning that any isomorphic 
models are necessarily strongly equivalent. Thus the above result implies that, 
using PML^ formulae, the measures calculated from a model after aggregation 
will be identical to those that would have been calculated before. Therefore, from 
the user’s point of view the aggregation remains transparent even when reward 
calculations are to be carried out. This is not the case for the other reward 
specification techniques used in SPA models. 

Some PMLp derived combinators are introduced in Equation 1. These add 
no expressive power to the logic, but will prove more succinct in expressing 
particular properties later. Informally, [a\^F is the set of processes which can 
make an a-transition with rate at least n, the derivative of which must satisfy 
F, and Aq, is those processes which are able to perform an a activity. 

ff 

= ^{a)f,^F 

Fi V F 2 = A (^^ 2 )) 

( 1 ) 

When specifying some performance measures it is natural to use the idea of model 
states, as well as model behaviour in a state. This can be smoothly reconciled 
with the use of a probabilistic logic, and the computation of the reward vector 
can thus be seen as a two-stage procedure. The method is simple, and standard 
in the theory of process logics — it is to extend the syntax of PML^ with a set of 
variables V, and for a given model P with state space 5, to extend the semantics 
with a valuation function V : V ^2“^. 

F ::= tt I I -F I Fi A F 2 I {a)^F \ X 
P \= X if and only if F G V(A) 

The intuition is that a variable X G V represents a property which is true in 
a particular subset of the state space. This allows the expression of formulae 
such as ^((transmit)i 2 oFaz?S'tate), where FailState is understood to represent 
an undesirable portion of the state space — “it is not the case that it is possible 
to efficiently transmit a network packet and finish in a failure state” . We have 
found that it is useful to compose an additional monitor process with the model, 
and use this to label states. This is a well-known technique in verification but, 
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in general, it relies on the skill of the modeller to design such a process so that 
it does not alter the state space of the original model. 

Due to the predicative semantics of PML^, i.e. formulae evaluate to a charac- 
teristic function over the set of states, it is straightforward to specify utilisation 
and reliability measures which only require reward values of 0 or 1. The relation 
of PML^ to measures such as throughput which require real-valued rewards, is 
less direct. For this reason PML^ is not, in general, used alone, but as part of 
the richer PEPA Reward Language. 

3.3 PEPA Reward Language 

The definition of a reward structure in the PEPA Reward Language is comprised 
of two parts: 

— a reward specification, which associates a value with a logical formula, spec- 
ifying a behaviour; 

— an attachment which determines with which process derivatives a particular 
reward specification is associated, reflecting the compositional structure of 
the model. 

The meaning of the reward specification will depend on how it is “attached” to 
a PEPA model — the associated value may depend on information local to the 
derivative under consideration. This will be explained when the semantics of the 
reward language is described below. 

Formally, each reward specification can be considered as a pair consisting of a 
logical formula and a reward expression. Following the attachment, the formula 
is checked against a set of subcomponents within the context of the model as 
a whole. When the formula is satisfied the corresponding derivative is assigned 
a reward. The value of the reward corresponds to the evaluation of a simple 
arithmetic expression. 



Syntax and Semantics of Reward Expressions. The syntax of reward 
expressions, given below, is very simple; indeed, it captures little more than a 
straightforward syntax for arithmetic. The only additions to this are three bound 
variables. 



e ::= (e) | ei -I- C 2 | ei — C 2 | ci x 62 | ei/c 2 | atom 
atom ::= r G M | cur \ rate{a G Act) 

The bound variables cur and rate{) will be used to denote real numbers. The 
meaning will be dependent on the reward structure being built, and the partic- 
ular labelled multi-transition system which results from the PEPA model under 
consideration. They exist for pragmatic reasons — they are useful in specifying 
performance measures. The variable cur is intended to give the reward expression 
access to a “currently” assigned reward, allowing reward expressions to make use 
of previous assignments. The variable rate{), allows activity rates to be used in 
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expressions — specifically, reward values can be assigned to a derivative P which 
make use of the transition rate from P to successor derivatives via an activity of 
type . This is the way in which timing information may be incorporated into 
reward specifications. 

The objective is to define a reward function p, such that if P is the PEPA pro- 
cess under consideration, and ds computes its derivative set, then p : ds{P) — 
That is, given a derivative (in fact a state of the transition system) p gives the re- 
ward assigned to that derivative. The complete reward structure may be built up 
by successively overlaying the effects of different reward functions — this explains 
the inclusion of the variable cur. 

Given this reward assignment function, the semantic function relies on the 
context c of a PEPA process P to define the meaning of reward expressions; the 
semantics are given in Figure 1. 

Il(e)||p. = ||e||p. 

II ei op 62 II P, = I 6i II op II 62 II P, 

IkllPc = r 

\cur\p^ = p{c[P\) 
lkate(a)||p, =E{r:c[P|— } 



Fig. 1. Semantics of reward expressions 



The notation | e ||p^ denotes the evaluation of expression e with respect to P 
in the context c; c[P] denotes P in the context c and the binary operator op is 
intended to capture the obvious binary operators defined in the syntax above. 
The following definition completes the definition of a reward specification. 

Definition 2 A reward specification is a pair {F,e), where F is a PML^ for- 
mula and e is a reward expression. 



Creating a Reward Structure with Attachments. When the behavioural 
specification captured by the reward specification relates to subcomponents 
within a model, rather than to the model as a whole, an attachment may be 
used to guide how the formula is to be tested against the model. For instance, 
given a large PEPA model, it may be interesting to only examine the performance 
of a single component queue. It should be possible to disregard the behaviour 
of the rest of the model, at least up to its interaction with the queue under 
examination. To achieve this, contexts are employed. 

Definition 3 An attachment, a, is a triple (cr, c, (Pj , . . . ,Pn)), where a is a 
reward specification, c is a context, and Pi are PEPA processes, for 1 < i < n. 



^ If a is passive in P rate{) is undefined. 
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The attachment allows the modeller to choose which subcomponents are of 
interest — the subcomponents are the processes Pi. 

Now let p : ds{P) — > K represent a function constructing a reward structure, 
and let P be a PEPA process. Assume an initial value of p{P') = 0, for all 
P' G ds{P). The semantic function for attachments takes as an argument a 
reward assignment function p and evaluates to a new function, say p'. This 
possibly modified assignment function will reflect any new rewards that have 
been assigned to the PEPA model. Its argument is a sequence of attachments. A 
sequence is chosen so a reward structure can be built sequentially, allowing one 
reward expression to make use of the values present in the partially constructed 
reward structure. Evaluating such a sequence of attachments is trivial — each is 
evaluated individually, in order. This is shown below. 



ll()llp = P 

l(ai,a*+i,... ,am)||p = || (oi+i, Oi+ 2 , . . . ,am)\p' where p' =||a*||p 
The meaning of an attachment can now be defined. 

Definition 4 (Semantics of an attachment) The meaning of an attachment 
Oi = {{F, e), c,{Pi,... , Pn)) is a value determined as follows: 

n II f l|e||c[P,,...,P„] if {Pi,--- ,Pn) 

* |0 otherwise. 

p' is created by ordinary function perturbation and the end result is a function 
which constructs a reward structure over the derivative space of a PEPA process. 
More details can be found in [20]. 

4 Implementation 

The PEPA Workbench has been extended to allow the use of a subset of the 
PEPA Reward Language. This allows the modeller to express behavioural prop- 
erties using PMLp, though currently the use of contexts is not implemented. The 
implementation automatically generates a reward structure which provably gen- 
erates the same performance measures for any two strongly equivalent models. 
This means that the modeller may apply aggregation to a PEPA model without 
having to alter the description of any performance measures. 

Given a PEPA model, the Workbench generates a representation of the 
model’s generator matrix which is then solved. In order to generate the ma- 
trix, it is necessary for the Workbench to traverse the entire state space of the 
PEPA model. After this traversal, for each state of the model, a reward specifi- 
cation can be checked, and if satisfied, a reward assigned. The algorithm used to 
implement this subset of the reward language employs a simple model checking 
procedure for PML^. 
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5 Case Study: Self-Checking Distributed Computation 

Our case study comes from the TIRAN project (“Tallorable fault toleRANce 
frameworks for embedded applications”, IT Project 28620). In this project ex- 
ample fault-tolerant applications are provided by industrial partners (Siemens 
and ENEL). The objective of the project is to build a modular framework in 
which the faults, errors and failures can be methodically considered. One mod- 
ule in the framework is the TIRAN backbone, responsible for tolerating internal 
and external faults. The backbone is currently under implementation and here 
we present a simplification of one of the key algorithms which is used. 

The setting is an embedded system made up of a number of loosely cou- 
pled nodes without shared memory. Communication is by synchronous message 
passing. Agents run on each of these nodes. One of the agents is designated 
the manager of the others. In our simplified model, we assume that the man- 
ager never experiences failures. Without this simplifying assumption it would be 
necessary to describe a leadership election in addition to the processing which 
we describe here. We concentrate on the internal faults which are detected in 
the self-checking phase of the distributed computation. During self-checking the 
agents monitor their own progress and may declare themselves to be faulty. 

The manager periodically broadcasts a query to all of the agents which asks 
“are you alive?” . The manager waits to receive one of two possible replies: the 
agent responds “this agent is alive” (alive) or “this agent is faulty” (faulty). 
The latter means that the agent has detected and trapped a fault and is in an 
uncertain state. If no reply is received within a certain timeout the manager 
assumes that there is a hardware fault on the node on which the agent was 
running. In this case a recovery process is initiated for the node. 

Each agent is composed of three sub-processes. These are responsible for fault 
detection, isolation and restarting. The detector sets a local flag to indicate that 
there is no fault at present. It then reports back to the manager that this agent is 
alive. The same flag is periodically unset by the isolation process. If the isolator 
process later reads the flag and finds it still unset then it reports back to the 
manager that this agent is faulty. If the agent is faulty, the restart subprocess 
ensures the correct re-initialisation of the agent. 

We reflect the structure of the system in the components which are used in 
the model description. We define a Manager component, and generic Agent com- 
ponents. The manager controls Daemon processes, one for each agent which it 
must manage. Within an agent, we have sub-components for detection, isolation 
and restarting. Since we work with exponential assumptions, the deterministi- 
cally timed timeouts are here approximated by exponential distributions. 

Manager = (query ,q). Manager 
Daemono = (query ,T).Daemon\ 

Daemoni = (alive, T).Daemono-\- (faulty ,T).Daemon 2 ~\- (timeout, t\). Daemons 
Daemon 2 = (restart Agent, T).Daemons (timeout ,t 2 ) .Daemons 

Daemons — (repairNode,rn). Daemono 
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The agent composes sub-processes, synchronising on the relevant activities. 

Agent = ( (Detector o IXI Isolator o) IXI Restart^ 

where Li = { timeout, flag q, flag i, restartAgent, repairNode } 

L2 = { timeout, alive, faulty, restartAgent, repairNode } 

The detector process is concerned with the value of the flag and with reporting 
that the agent is still alive. It registers any timeouts which occur and witnesses 
the recovery of the agent or the node. 

Detectoro = {flag ff). Detector 1+ {flag ^,T) .Detector 2+ {timeout, T). Detector^ 
Detector I = {alive, a). Detectoro + {timeout, T). Detector 3 
Detector2 = {restartAgent, T). Detectoro + {timeout, T). Detector 3 
Detector3 = {repairNode, T). Detectoro 

The isolator process receives the “are you alive?” query from the manager. After 
checking the value of the flag it has the responsibility of sending the fault report, 
if this is appropriate. As with the detector process it registers timeouts and any 
recovery processes. 

Isolatoro = {query ,T).Isolatori 

Isolatori = {flagi, T). Isolatoro + {flago, fo).Isolator2 + {timeout, T). Isolator 4 
Isolator2 = {faulty , f). Isolator 3 + {timeout, T). Isolator 4 
Isolator3 = {restart Agent, T). Isolatoro + {timeout, T). Isolator 4 
Isolator 4 = {repairNode, T). Isolatoro 

The final process is responsible for restarting the agent after a fault. It must 
witness the other reports from the other sub-processes in order that it does not 
restart the agent when this action is not necessary. 

Restarto = {alive, T).Restarto + {faulty ,T). Restart i + {timeout, T). Restart 2 
Restarti = {restartAgent, ra). Restarto + {timeout, T). Restart 2 
Restart2 = {repair N ode, T). Restarto 

The complete system is configured as a composition of agents paired with an 
instance of the daemon. This structure in the model represents a single manager 
process in the system with channels to communicate with each of the agents 
running on the remote nodes. In order to model a broadcast of the “are you 
alive?” message, all of these paired agent and manager processes are required to 
synchronise when the message is sent. The replies are received asynchronously by 
each instance of the daemon and all must be received before another broadcast 
can be sent. In the case of a fault of an agent or a node the restart action must 
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be performed on that node before the next broadcast can be sent. Below we 
show the system with two managed agents. 

ManagedAgent = Agent W Daemono 

System = (ManagedAgent ManagedAgent) Manager 

' { query} ' { query} 

The synchronisation set L between the agent and the associated daemon contains 
the activity types query, alive, faulty, timeout, restartAgent, and repairNode. 

As further managed agents are added the state space of the system increases 
proportionally. Of course, the model exhibits symmetries which can be exploited 
in order to eliminate uninteresting variants of states. The aggregated state space 
is more compact to store and more efficient to solve in order to find the steady 
state probability distribution of the system. This aggregation is performed au- 
tomatically using the PEPA Workbench. 



5.1 Investigations into the Model 

The effect of aggregation has been to eliminate some syntactic presentations of 
states. It may perhaps be the case that the states are still present, but that their 
presentation has been altered. This can be a distinct disadvantage to the mod- 
eller unless an expressive language is also provided which allows the definition of 
performance measures over the model without reference to the syntactic presen- 
tation of states. Without such a language, the modeller must have a thorough 
knowledge of the aggregation method deployed in order to identify the syntactic 
presentations which remain after aggregation. Even with complete knowledge of 
the aggregation method used, the definition of performance measures would still 
be made more awkward and unnatural for the modeller. 

The logical notation which we provide for expressing performance measures 
avoids the need for the modeller to understand the aggregation method employed 
by the tool, and to reproduce its effect on the states of interest. The same 
definition of a measure can be used without change, whether or not aggregation 
is performed. Further, the definition of measures which we use is independent 
of the number of agents in the system, and can also be re-used unchanged for 
larger configurations of the system. 

To illustrate the use of the PEPA Reward Language, we compute here two 
significant performance measures for the system. The first is the potential for lag 
in the system. This occurs when one of the agents has detected a local fault but 
its appointed daemon has not yet registered this information. This can be simply 
expressed as A faulty This captures all of the states which have the potential to 
perform a transition faulty. A reward of / (the fault registration rate) is assigned 
to these states. 

The effect of varying the flag registration rates, fo and /i, on this measure 
is shown in Figure 2. As might be expected, the measure is more sensitive to 
variations in /o since this is the rate associated with the flagQ activity, which 
leads to the states where the faulty message can be sent. In Figure 3 we show 
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Fig. 2. Plot of lag I while varying rates /o and /i 



that the effect of timing the agents out at a faster rate is to pre-empt fault 
registration, causing agents to be restarted by the manager more often. 
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Fig. 3. Plot of lag I while varying rates fo and ti 



The second measure which we define captures the potential for successive 
timeouts. This measure would be of interest if tuning the system to find the 
right balance between waiting for faulty agents to self-repair or pre-empting 
them by software on hardware resets. 

Timeout activities are distinguished in this model because they are the only 
activity which is performed at more than one rate (ti and t2). We express our 
measure with a third variable, t. 

{timeout) t {timeout) ttt. 

By varying the relative values of t,ti and t2, this measure will include one, two 
or none of the classes of timeout activity. Thus the ability to use rate variables in 
reward expressions allows us to include just the activities which are of interest. 
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6 Related Work 

We have presented the PEPA Reward Language, a notation for the description 
of performance specifications which relate to stochastic process algebra models 
expressed in PEPA. The study of a self-checking distributed computation illus- 
trates the way in which a modeller would use the PEPA Reward Language to 
reason about the performance of a model. 

An alternative approach to constructing reward structures over SPA models 
is presented in [21]. In that paper, Bernardo extends the syntax of EMPA, so that 
each activity is augmented with a reward value, i.e. each activity is represented 
by a triple comprised of type, rate and reward values. In the generated reward 
structure, the reward assigned to each state is the sum of the rewards associated 
with activities enabled in that state. Bernardo has constructed an equivalence 
relation which respects the additional reward information. 

Our choice of PML was motivated by its simplicity, and its link to PEPA’s 
strong equivalence. Other research in the area of probabilistic verification has 
links to our approach. Logics such as that presented by Hansson and Jons- 
son [9] are able to specify bounds on probabilistic properties, but crucially these 
are probabilities over behaviours from a specified state. Recent work by de Al- 
faro [12] addresses the problem of specifying “long-run” average properties of 
probabilistic systems, with non-deterministic choices made by adversaries. De 
Alfaro defines experiments to represent interesting model behaviour patterns. 
These experiments associate a real- valued outcome with a pattern of behaviour, 
and are considered to occur infinitely often. In [22] Baier et al. describe how 
a temporal logic can be used to specify transient properties of CTMCs. Their 
model checking procedure aims to establish whether such properties hold or not. 
This is quite distinct from our use of logic to construct the reward structure used 
to calculate steady state performance measures. 
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Abstract. As hardware becomes faster and bandwidth greater, the de- 
termination of the performance of software based systems during de- 
sign, known as Software Performance Engineering (SPE), is a growing 
concern. A recent seminar of experts at Dagstuhl and the First Interna- 
tional Workshop on Software and Performance have both highlighted the 
need to bring performance evaluation into the software design process. 
The Unified Modelling Language (UML) has emerged in the last two 
years as a widely accepted standard notation for software design and it 
is an attractive vehicle for SPE. In this paper UML’s Collaboration and 
Statechart diagrams are shown to allow systematic generation of Gener- 
alised Stochastic Petri Net (GSPN) models, which can be solved to find 
their throughput and other performance measures. Using the example 
of communication via the alternating bit protocol, such a mapping is 
demonstrated and the resulting GSPN solved using the SPNP package. 
The basis of a usable methodology for SPE is explored. 



1 Introduction 

The hardware in computers and communication networks is becoming faster and 
its offered bandwidth continues to increase. As a result the software is increas- 
ingly seen as the bottleneck in such systems. It is widely accepted that at least 
eighty percent of the performance of a system is determined by its general ar- 
chitecture and that this is as true of software as it is of hardware. This makes it 
essential that performance can be analysed from the earliest stages in a design. 

In the last two years, the software design community has embraced a move to- 
wards a new design notation, intended to provide a common vocabulary for soft- 
ware based systems. This Unified Modelling Language (UML) jHUBE] is gaining 
widespread acceptance and this has made it possible to focus on a single nota- 
tion. Since the notation includes both static and dynamic aspects of systems it 
is very well suited to generating performance results, although additional infor- 
mation on timings and branching probabilities is needed. Here we consider an 
example of using UML to generate Petri net models for performance prediction 
in a communication network. 
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The rest of this paper is structured as follows. Section describes the ap- 
plication modelled - a simple network using the alternating bit protocol (ABP) 
and considers how to model this application in UML. Section looks briefly at 
how UML models have been used in performance analysis to date. Section U 
uses the UML model to generate a Generalised Stochastic Petri Net (GSPN) [2] 
model of the alternating bit protocol and its solution using SPNP|2|. Section 
summarises the numerical results, with graphs of throughput against timeout 
interval. Finally section |7] draws conclusions from this work and suggests the 
way forward. 

2 Modelling ABP in the Unified Modelling Language 

2.1 The Alternating Bit Protocol 

The alternating bit protocol (ABP) PP is the simplest known protocol for reliable 
communication between two nodes. It uses flow control with a window size of 
one, which can be encoded as a single bit in any packet. 

A transmitter first sends a packet with the sequence bit set to zero. It then 
waits until either it receives a corresponding ack packet from the receiver or 
until the defined timeout period has been exceeded, when it assumes the packet 
is lost and retransmits. 

Once an ack is received the transmitter can send its next packet, with the 
bit set to one. It again awaits the corresponding ack, retransmitting after each 
timeout period. 

This continues, with the sequence bit alternating on successive packets. 

This protocol has been formally verified, for instance by Milner [13] • Its 
performance has been studied several times, including by the use of the TIPP 
stochastic process algebra [6]. 

2.2 The Unified Modelling Language 

The Unified Modelling Language (UMLl jldllSIBI is a graphically based notation, 
which is being developed by the Object Management Group as a standard means 
of describing software oriented designs. It contains several different types of 
diagram, which allow different aspects and properties of a system design to be 
expressed. Diagrams must be supplemented by textual and other descriptions 
to produce complete models. For example, a use case is really the description of 
what lies inside the ovals of a use case diagram, rather than just the diagram 
itself. 

More recently there has been interest in exploiting at least some parts of the 
UML for real time and embedded systems design and hardware design. These 
have focussed on the interaction and statechart views within the notation. We 
have found, similarly, that these views are most appropriate to the modelling of 
communications protocols, but we also believe that use case models are impor- 
tant when considering how such protocols will be used to meet user needs and 
in defining the workloads to be expected by systems. 
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Use Case Models. In recent years the most radical addition to the first gener- 
ation of object oriented modelling approaches is use case modelling. This shows a 
system in terms of the external users of the system, known as actors and usually 
shown as “stick people” . An actor may be a system, not necessarily a person. 

Each actor is shown as participating in one or more use cases, shown as 
ovals linked to the participating actors. A use case is some high level activity or 
capability of the system. Use case diagrams are significant in showing external 
systems and users that initiate functions. 



Figure [T]is an example of a use case diagram showing a communications net- 
work from the user perspective. Each use case represents one class of application 
which uses the underlying protocols to communicate. For simplicity we assume 
only point to point communication is required. 

The Classes Used. For the purposes of this paper, we assume that classes 
and objects exist as the fundamental units of description within a design. In 
particular, classes encapsulate behaviour, which can be described by a state 
machine description. 

In the ABP example we identify the classes shown in Figure El Node is 
the basic component which handles communications to and from one point in 
the network. It has a Sender and a Receiver, which are shown as permanent 
components within it. Channel is the logical communications medium. Each 
channel is associated with one sender and one receiver. It supports the operation 
send. ActiveProcess is a generalisation of Sender and Receiver. It requires 
all its subclasses to provide an operation receive. Its specialisations differ in 
the rest of their behaviour. They support the operations pSend for Sender and 
forward for Receiver. Sender is the sending process in a node. It is shown as a 




Browse Web page 



Fig. 1. Use case diagram for nodes communicating 
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Fig. 2. Class diagram for nodes communicating 



specialisation of ActiveProcess. Receiver is the receiving process of a Node. It 
too is shown as a specialisation of ActiveProcess. 

Sequence diagrams. Sequence diagrams are based on message sequence 
charts [I]. Objects are shown as boxes with dashed lines extending vertically 
below them. As these lines move down the page, they represent the passing 
of time. Arrows across the page show the sequences of messages passed be- 
tween objects as time passes. Periods of activity by an object may be shown 
as a bar, called an activation, overlying its dashed life line. 

Each sequence diagram represents one or more routes through the unfolding 
of a use case (high level) or of an operation in a class (low level) . If a single 
route is shown, one particular set of conditions is being assumed. Such a set 
of conditions is termed a scenario. 

The example in Figure [3| shows a successful transmission scenario with the 
sequence bit set to zero and no timeouts. The messages are all asynchronous. 
Collaboration diagrams. Here time is not represented implicitly. Instead mes- 
sages are numbered explicitly and the visual emphasis is on showing which 
objects communicate with which others. Messages are numbered to show 
the order in which they should happen. Causal nesting can be shown as a 
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Fig. 3. A sequence diagram showing a scenario in the ABP 



decimal numbering scheme. Partial orders, representing concurrency, can be 
shown by using names instead of numbers at the appropriate level. 




Fig. 4. The ABP modelled in a UML collaboration 



The example in Figure |T]shows the same scenario as the sequence in Figure|3] 



State and Activity Diagrams. UML defines state diagrams, which allow a 
class to be defined in terms of the states it can be in and the events which cause 
it to move between states. 
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They are essentially Harel Statechartsjd]. These derive from conventional 
state transition diagrams, with the additional features that: 

— a state may indicate that the object is engaged in some activity; 

— transitions between states can be due to messages or to changes in certain 
conditions or to a combination of these; 

— states can be nested within super-states. 

Harel Statecharts predate UML and form a very rich modelling formalism in 
themselves. We have only explored a simple subset of their features so far in our 
work. This example illustrates the most straightforward possibilities. 




Fig. 5. A state diagram 



The example in Figure E shows the internal behaviour of a Sender. This 
should correspond to the behaviour of all interactions involving instances of this 
class. 

The transitions shown are triggered either by incoming messages, such as the 
deliver ( ack, 1 ) message which triggers the transition from state Sent 1 to state 
Can Send 0, or by the passing of time, such as the after (t) indicating the timeout 
which triggers transitions from both Sent 0 and Sent 1 to themselves. 
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The self transitions, where a transition is shown leaving a state and returning 
to itself, are used to show the actions which result at this point, as with the need 
to resend the packet after a timeout, or to indicate that certain messages can 
be accepted and are ignored, such as the arrival of a duplicate ack in the states 
Can Send 1 and Can Send 0. 

3 Using UML for Performance Modelling 

In this section we survey existing ideas for exploiting UML designs for perfor- 
mance modelling. These now include both simulation methods and queueing 
network modelling approaches. We start by considering each form of description 
in turn. We then consider an approach combining different approaches. 

3.1 Exploiting Implementation Diagrams 

Implementation diagrams, which we have not considered here, provide a final 
mapping of collaborations (components) onto computing and storage devices. 
The overall system can be modelled as an open queueing network. This approach 
was used by Pooley and King[TJ] to model a simple autoteller machine system. 



3.2 Direct Simulation of Sequence Diagrams 

Several previous papers have identified sequence diagrams as having the potential 
to generate and display useful information relating to performance. 

Smith and Williams have published a number of case studies mm where 
they show how an object oriented design can be modelled, starting from the 
class diagram and appropriate scenarios expressed as sequence diagrams. 

Timing information can be added by labelling messages with relative timing 
constraints, using nested numbering of messages to describe exact orderings. In 
a prototype simulation tool Kabajunga and Pooley [8l9] encoded the information 
about time intervals between messages, along with object names, in a driver file. 

Permabase. In a project sponsored by BT, a group at the University of Kent 
at Canterbury have built a system for modelling distributed object systems by 
automatically generating simulation models from UML use case and interaction 
models 1^. 

Pooley and Hargreaves [Q defined and built a Java class library for simulation 
of combined statecharts and collaborations. Numerical solution of these models 
was introduced by King and Pooley(TT] using a database example. 

4 Deriving a Petri Net Model of ABP 

The task of performance modelling requires insight and skill. This will always 
be true. We can, however, reduce the number of mechanical tasks needed in 
this process. In particular, we can define and implement mappings from UML 
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designs into appropriate performance modelling formalisms. In this section we 
develop a detailed model using Petri nets, exploring the automatic mapping from 
combined collaboration and statechart models. 



4.1 Petri Nets 

Petri nets are a formalism which has been developed to describe the concur- 
rent behaviour of interacting systems. A system is represented graphically us- 
ing places, transitions, arcs, and tokens. Places, graphically represented by 
circles, stand for the sub-states that parts of the system can be in. A token, 
indicated by a spot, is present whenever that part of the system is in the corre- 
sponding sub-state. The place is said to be marked. Transitions are represented 
by solid bars. Transitions have input arcs which link places to transitions, and 
output arcs which link transitions to places. There are no arcs directly from one 
place to another or from one transition to another. 

A transition is enabled if all of the places attached to its input arcs have a 
token present. A transition which is enabled may fire, causing one token to be 
removed from each of its input places, and one token to be added to each of its 
output places. Traditional Petri nets are used to investigate such issues as mutual 
exclusion and freedom from deadlock. Deadlock occurs if the net can ever reach 
a state in which no transition can fire. Mutual exclusion can be demonstrated 
by proving that two places in different components are never simultaneously 
occupied by a token. Peterson [1 5| is an excellent introductory text on Petri nets 
that shows how to use Petri nets for these problems. 

Traditional Petri nets do not measure time in any way. A number of exten- 
sions have added timing information, sometimes to allow more powerful theo- 
rems to be proved about the set of states that can be reached. The most popular 
timed extension is to specify stochastic Petri nets, in which some transitions fire 
a randomly distributed time after they become enabled. These are usually rep- 
resented by an open bar. If the distribution of time until an enabled transition 
fires is exponential, then the set of states that the Petri net can enter forms a 
Markov process, and its transition matrix can be generated, and performance 
metrics can be calculated from the steady state solution. 



4.2 Use Case and Workloads 

The actors in use case diagrams should represent all external stimuli to the 
system being modelled. There is considerable disagreement about which external 
systems should be shown here, but at least all initiators of activities within the 
system should be represented. It seems logical, therefore, to use actors as the 
basis for defining workloads in the system. Each actor does not correspond to a 
single person or system, but rather to a (set of) role(s) played by one or more 
people or systems. Thus each actor may represent just one part of the workload 
for one part of a system. 
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Similarly, each use case represents one class of requests. In the ABP model, 
we have shown three use cases, which might well represent different packet length 
and arrival rate distributions. 

4.3 A UML Combined State and Collaboration Model 

The diagram in Figure El shows the collaboration from our UML model of ABP, 
with the statecharts of all the objects embedded within them. If this is a correct 
model, all possible behaviours of the model are captured. 

Each object can be in any of the states within its statechart. The current 
state of the whole system is represented by the combination of the states of the 
individual objects. The number of possible combinations may be greater than 
the number of combinations which the system can actually evolve into. This will 
depend on the initial combination of states and the behaviours of the objects. 

The scenarios which the system can execute should correspond to the overall 
behaviour of this model and the resulting states form the reachable subset of 
overall states. 

4.4 The Corresponding GSPN Model 

Translation of the state diagram representing each object into a corresponding 
stochastic Petri net is straightforward. Each state in the state diagram is rep- 
resented by a place in the Petri net. A token is present in a place if the state 
machine is in the corresponding state. The transitions in the state machine are 
also mapped into transitions in the Petri net. The input place to the Petri net 
transition corresponds to the state at the start of the state chart transition, 
and the output place from the Petri net transition represents the target state 
of the statechart transition. In simple, single threaded objects, these Petri net 
transitions all have one input and one output place. Statecharts representing 
objects with internal concurrency can also be represented without difficulty. It 
is possible that a single transition in the state machine can be represented by 
several different transitions in the Petri net. Such transitions all have the same 
input and output place, and are known as competing transitions. When the state 
chart is considered in isolation from the other state charts in the model, which 
of the competing transitions fires is unimportant. Transitions in the state chart 
may be internal, with no interaction with other objects; they may be timed, 
corresponding to the use of the after (t) construct; or they may be associated 
with messages received from or sent to other objects. 

Consider the sender process in the alternating bit protocol; it has four states, 
two corresponding to sequence number 0 for the packet and two corresponding to 
sequence number 1. It cycles between CanSendO, SentO, CanSendl, and Sentl, 
in that order. It moves from CanSendi to Senti when a new packet becomes 
available, and is transmitted on the forwards channel. Likewise, when the ac- 
knowledgment of packet i arrives, the state of the sender process changes from 
Senti to CanSendi-|-l. When the sender is in state Sent? it periodically times out 
and sends a further copy of the packet with sequence number i. This transition is 
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Fig. 6. The alternating bit protocol modelled as a UML collaboration with 
embedded statecharts 



a self loop and does not change the state of the process. The after (t) labelling 
on the transition in the state diagram is captured by using a timed transition 
in the Petri net. The Petri net corresponding to this state diagram is shown in 
FigureEl The transitions corresponding to the receipt of an acknowledgment and 
to the occurrence of a time out and its subsequent retransmission are competing. 

Likewise, the Petri net corresponding to the state diagram for the Receiver 
process is a direct translation which can be seen in Figure |8] The channel will 
deliver packets. When the receiver is in state WaitforO, a packet with sequence 
number 1 will cause a transition to the same state, a self loop. When sequence 
number 0 arrives, the receiver moves to state Waitforl, and subsequent recep- 
tions of sequence number 0 cause transitions which do not change state. Each 
transition also causes the transmission of an acknowledgment with the same 
sequence number as the packet. 

The channels are modelled using a Petri net shown in Figure [HI one net for 
each direction of transmission. The channel is initially idle, and makes the tran- 
sition to Sent when a packet is transmitted. A timed transition corresponds to 
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Fig. 7. Petri net corresponding to Fig. 8. Petri net corresponding to 
Sender Receiver 



the after (d) guard on the transition to Succeed or Fail state, where probabili- 
ties are attached to the transitions OK and Loss to represent the probability of 
transmission error. If an error occurred, the Loss transition occurs and the chan- 
nel becomes idle again. If no error occurs, then the channel enters the Arrived 
state, until the packet can be delivered, when it reverts to the idle state again. 
The Petri net shown for the channel is a slight simplification, since separate Sent, 
SucceedorFail and Arrived places are needed to represent the two possibilities 
for sequence numbers. 

It is noticeable that each Petri net has a straightforward invariant: the num- 
ber of marked places is always identically 1. This phenomenon occurs because 
there is no concurrency within the UML objects. With this observation, we can 
combine the Petri nets corresponding to the various state diagrams by identify- 
ing transitions in the separate Petri nets with one another. The identification is 
made on the basis of the messages transmitted or received in the corresponding 
statechart. In particular, the Accept transition of the forward channel is identi- 
fied with the InitSendt and TimeOutt transitions in Sender. The Idle place in the 
channel becomes an input place to those transitions, and the Sendt place becomes 
an output place. Likewise, the Deliver transition in the channel is identified with 
the Received and Duplicated transitions in the Receiver. For the reverse channel. 
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Fig. 9. Petri net corresponding to Channel 



which carries the acknowledgments, the Accept transition in the Petri net is 
identified with Received and Duplicated transitions in Receiver. Figure [TU] shows 
the part of the complete Petri net model corresponding to the transmission and 
reception of packets with sequence number 0. The arcs and places relating to the 
reverse channel carrying acknowledgments and for the forward channel carrying 
packet with sequence number 1 are not shown. Note that the Idle place for the 
forward channel is not duplicated, only the places corresponding to the presence 
of a packet. 




Fig. 10. Combined Petri net (partial) 
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5 Numerical Results 

In order to confirm the correctness of the model that we have constructed, we 
have compared our numerical results with those calculated by Hermanns et al. [B] 
using their TIPP system. This is a stochastic process algebra which allows the 
development of numerical solution. 

Our stochastic Petri net solution was encoded using Trivedi’s SPNP pack- 
age, and solutions evaluated for differing arrival rates, time out intervals, and 
probabilities of transmission error. No attempt was made to model the effect of 
the non exponential nature of time outs or packet transmissions^ All our exper- 
iments used a mean transmission rate of 10 packets per second. This means that 
the maximum throughput that can be expected of the system is 5 packets per 
second, since acknowledgments take similar transmission time to packets. The 
probability of a transmission error was chosen to be either 0.01, 0.05 or 0.1, and 
the mean time out interval ranged from 0.02 seconds to 2.5 seconds. 

Maximum throughput when the transmission error probability is 0.01 is 4.37 
packets per second, and occurs when the time out interval is 0.5 seconds. When 
the error probability is 0.1, the maximum throughput obtained is 3.37 packets 
per second, and occurs when the time out interval is 0.1 seconds. 

The graph in Figure [TT] shows the maximum throughput as a function of the 
time out interval. It can be seen that when the time out interval is very long, the 
throughput is low, because any packets which are lost because of transmission 
errors have a long wait before they are retransmitted. As the interval reduces, the 
throughput increases. A maximum throughput is reached, and then the through- 
put decreases slightly, because time outs are occurring too frequently, even when 
the packet has not been lost, and the retransmission of correctly received packets 
is impeding the use of the channel by the subsequent packet. 



6 Current Work 

The example used is perhaps the simplest that could be used with any appeal to 
usefulness. These remain a number of tasks that might broaden the usefulness of 
the approach. The most significant is to automate the mapping from the UML 
description to the performance model. This involves both the exploration of a 
wider variety of models and consideration of alternatives to GSPNs as a target. 

Some early attempts have allowed a cautious optimism that a wider concept 
of state, shown in UML by inclusion of conditional behaviour based on variables, 
and concurrency within objects, also allowed in Statecharts, can be accommo- 
dated. The problems are apparently those inherent in GSPNs themselves. 

At the same time, it appears that both alternative mappings, for instance 
onto stochastic process algebras (SPAs) , and direct generation of the underlying 

^ The TIPP figures simulated a constant timeout interval by using an Erlang-fc distri- 
bution for the time out. There is no reason other than coding convenience that such 
a technique was not used in our system. 
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Fig. 11. Maximum Throughput as a Function of Time Out Interval 



continuous time Markov chain (CTMC) without an intermediate representa- 
tion have possibilities. Very early results in both of these are reported by us in 
m, while there have been recent developments in solving collaboration diagram 
models using iteratively solved queueing networks [ 13 ]. 

7 Conclusions 

We have demonstrated that, for suitably constrained models, UML specifications 
of interacting software processes can be transformed into stochastic Petri nets. 
Numerical evaluation of these Petri nets can give estimates of the performance 
of the software. This provides a straightforward link between an existing popular 
design notation, and performance models of the system being designed. Further 
work is being undertaken to automate the translation from UML state diagram 
to Petri net, to explore alternatives to Petri nets, to understand alternatives 
developed elsewhere and to build performance evaluation capability into a CASE 
tool. The ultimate aim is to produce a tool in which performance problems can be 
identified early in the design cycle and tracked throughout software development. 
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Abstract. Video-on-demand (VoD) applications require architectures 
which are scalable over a wide range of capacity. Two of the most 
promising architectures for VoD are massively parallel symmetrical 
computers based on distributed memory (MPP) and clusters of 
workstations interconnected by a high-performance network. The work 
described here investigates the scalability of a prototype cluster 
architecture and compares it with a commercially available hypercube 
MPP system. Measurement experiments using emulated workloads 
were carried out for increasing loads, and the effects on capacity of 
encoding-mode and user interactions were explored. Static analysis, 
supported by simulation results was used to make projections over a 
range of configurations and technology assumptions. Sample results are 
presented which chart scalability on two hierarchic levels. 



1 Introduction 

Scalability is a vital consideration when entering new application areas involving 
potentially large numbers of users. Video-on-demand (VoD) applications require 
architectures which are scaleable over a wide range of capacity [10]. This paper arises 
from a collaboration between telecommunications companies (see Acknowledgement) 
which explored scalability via case studies for two types of server architecture: MPP 
hypercube and Workstation cluster, in the context of rapidly changing technology. 

In a typical system (Fig.l), the VoD server communicates with a large number of 
clients over a broadband communications network. Users issue commands to the 
server via some form of keypad and receive audio and video information via some 
form of display. Such systems have a variety of applications in training, 
entertainment, etc, with greatly varying workload characteristics. 

The investigation used a combination of measurement and modelling techniques: 
benchmark experiments, static analysis, simulation. This may be contrasted with more 
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analytical work on scalability reported elsewhere, e.g. [3]. The projected effect of 
technological change was built in to the analysis. The same methodological approach 
was used for both architectures. This paper however focusses on the workstation 
cluster case. 



Video 

Server 



data 

^ ATM network 
◄ 

commands 




Fig. 1. Video-on-demand schematic 



1.1 Scalability and System Requirements 

Scalability is a property not of an individual system, but of an architecture which can 
provide a range of systems. Individual systems must satisfy a specific set of 
requirements, both functional and non-functional. The trade-offs made between 
different requirements can vary from system to system. Qualitatively and intuitively, 
scalability may be regarded as the ability within an architecture to configure systems 
which satisfy widely varying capacity requirements under constraints expressed by 
the other non-functional requirements. Requirements in the VoD context typically are: 

• function ( in terms of the communication interface): data format (Motion JPEG, 
MPEG-1, MPEG-2. . .); bit rate encoding (variable or constant); nominal bit rate. 

• capacity: number of users (streams) which can be served, and number of different 
titles which can be accessed. Number of users has implications both for the 
connectivity of the system, and for the peak load levels. The peak streaming 
requirement for different video titles is also an important characteristic. Together 
these requirements give rise to an on-line storage requirement. There are thus 
several types of capacity requirement which vary with the application. 

• cost: ...of the overall configuration. The future cost of both software and hardware 
is hard to project. 

• quality of service: experienced by the end user as image and sound quality. There 
is a direct correspondence between the amount of information lost in accessing and 
transmitting the film and the perceived quality. Thus end-user quality is quantified 
by the frame loss rate measured at each client.' 

• reliability: the fraction of time the system is able to supply the specified number of 
streams 



* Picture quality can also be affected by encoding method. 
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System boundaries for the scalability study must also be defined. We took the 
interface between the VoD server and the communications network as the boundary. 



1.2 Experimental Approach 

The general approach was to conduct measurements on concrete examples of each 
architecture, and to extend the results by appropriate modelling and analysis. The 
experimental work took place over a period of more than 12 months, during which 
both the availability and state of development of the systems under test was changing. 

The work proceeded in two logical steps. Step one was exploratory. Measurement 
and test environments were developed and we investigated how different workload 
factors affected performance. Step two provided device utilisations and quality of 
service limits for use in configuration analysis. For the cluster architecture step two 
could not be completed until a multi-thread prototype (Base2) had been developed. 

The configuration analysis addresses streaming capacity, that is, the maximum 
number of parallel video streams that could be supported at the required quality of 
service. The basic data was acquired by measurement using a controlled workload. 
Projections for other configurations and for technology developments were then 
obtained via static analysis and simulation, using the measured system as a baseline. 

In the case of MPP technology this approach had to be modified. The architecture 
first analysed (nCUBE2) became superseded by nCUBE3. We therefore based the 
technology projections on simulation of nCUBE3 (by Telefonica) supported by the 
results of vendor benchmark measurements. This paper focusses on the cluster 
architecture, with some MPP data provided for comparison purposes. 



2 Architecture 

VoD servers have to provide a high degree of parallelism and support a high I/O rate 
with real-time constraints. They also need to be scaleable. The project compared 
experimentally and analysed two examples drawn from alternative architectural 
approaches to these requirements: 

• Massively parallel architectures (MPP) have been used successfully for VoD. The 
example chosen was a commercially available system with the Oracle Media 
Server DBMS on nCUBE. This architecture has many processing nodes arranged 
in a hypercube. A hypercube of dimension n connects each processing node with n 
neighbours, giving a regular closed structure with 2" nodes. In nCUBE2, the I/O is 
driven by seperate sets of nodes connected to the hypercube by fast channels. In 
nCUBE3, each hypercube node has its own local disk storage and network 
interface, providing a more homogeneous structure. 

• Cluster architectures This approach is characterised by independent workstations 
grouped in a cluster which communicate via a high-speed local network. A 
prototype example is the Elvira DBMS system running on multiple Sun 
workstations [4, 5]. The scalability of this particular cluster architecture is the 
focus of this paper. The Elvira software was developed by Olav Sandsta at NTNU 
and 0ystein Torbjprnsen at Telenor R & D, Trondheim, Norway. 
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2.1 The Cluster Architecture 

The following description by Torbjprnsen is drawn from the project documentation. 
The video server (Fig.2) consists of multiple Unix workstations where one is a 
command server and the rest video pumps. A remote client computer (video player) 
connects to the command server and submits commands. If a video sequence is to be 
delivered, the command server allocates delivery capacity on one of the video pumps 
storing the actual video film. The video pump connects to the client and starts to 
stream the video. The command interfaces are bi-directional TCP/IP connections. The 
delivery stream channel is either a uni-directional UDP/IP or ATM AAL5 connection. 




Fig. 2. Video Server - cluster architecture 



The server can be scaled by adding more video pumps. When a video sequence is 
loaded onto the system, it is manually stored on one pump and its entry recorded at 
the command server. The video can be replicated on several pumps. The command 
server is responsible for assigning clients to pumps with spare capacity. 

The video pump is implemented as a single multi-threaded process, consisting of: 

• One command thread handling messages from the command server. 

• One delivery thread for each stream being delivered to clients. 

• Two disk threads for each disk device storing data to be delivered. 



3 Scalability 

3.1 Scaling Functions 

The term scalability refers to the ability of an architecture to meet changing 
requirements for system capacity by changes in the system size. The quantitative 
relationship between these two properties can be represented by a scaling function. 
Size refers to the amounts of the various physical resources of the system for 
computation, data storage and communications (internal and external). A scaling 
function provides three kinds of information: shape, step-size and limits. (Fig.3a). 

If other requirements such as quality-of-service are fixed for a set of systems, then 
scalability can be related to cost. Firstly, the step-size in system resources which it is 
possible to configure, determines how smoothly capacity can be matched to changing 
requirements. System over-capacity or under-capacity adversely affects unit costs. 
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Secondly, the shape of the scaling function determines the economy of scale which 
can be achieved. Irrespective of cost, the available technology may not permit a 
device or node connectivity greater than (or even less than) some limit. Replication of 
the whole system may help, provided any central coordination which is required does 
not become a limiting factor. 

Each type of capacity requirement has its own scaling function. We shall focus on 
streaming capacity, but a complete analysis must also consider online storage 
capacity. The dependancy between these requirements is application-specific. 



3.2 Hierarchical Scaling 

In a parallel or distributed architecture the packaging of system resources has a crucial 
bearing on scalability. Size will often be expressed hierarchically in terms of numbers 
of replicated subsystems or nodes, each of a defined capacity. Thus it is necessary to 
consider two levels of scaling function: 

Node-level Here size cannot be readily expressed as a single parameter. There is 
a parameter pair (device capacity, number of devices) corresponding to each 
device type. In section 7, the node scaling function for the cluster architecture is 
determined by a configuration analysis and partially tabulated in Table 7. 
System-level Here, for a homogeneous design, size can be expressed as the pair 
(node capacity, number of nodes). In our case both architectures are designed to 
be linearly scalable with respect to the number of nodes. The relationship between 
system capacity, node capacity and number of nodes is represented in Fig. 3(b). 
An increment of 60 streams/node is a convenient common denominator for the 
two architectures. It corresponds to the full utilisation of a single ATM interface 
card. 




size 



Fig. 3(a) General scaling function 
showing shape, step-size and limits 




Fig. 3(b) Two system-level 
scaling functions 
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4 Workload 



Workload Characterisation 

VoD applications range over a wide spectrum. Some key characteristics are : 

• the size and play-length distribution of the title-set 

• access pattern to different titles 

• picture quality requirement 

• mode of usage: interactive requirements 

• load requirement: number of users on-line, possibly of different types 

• dynamic variations in load, usage patterns and content requirements 

• system management: downloading of new video films, etc. 

Mode of usage and load requirement were varied in the measurement experiments, 
while picture quality and play-length were held constant. Size and access pattern were 
considered theoretically in relation to online storage requirements. Dynamic 
variations and system management aspects were not given detailed consideration. 



Workload Model 

Executable scripts were used as a common basis of comparison for the two 
architectures. The interaction of users with the systems was emulated by a state 
machine: The states are the modes the video pump is running: “play”, “ff’ (fast 
forward), “fb” (rewind), “pause” and “stop”. Each state is associated with a dwell 
time, drawn from a uniform random distribution. After this time a new command is 
selected probabilistically, causing a transition to the next state. Scripts for low, 
medium and high levels of interaction were defined using appropriate dwell times and 
transition probabilities. Eig. 4 shows the values for the high interaction script. 



Movie running Last image fixed 




Fig. 4. State transitions for the high interaction script 



Scripts were set with the length of the video film and a random number seed. Video 
lengths were 60 and 30 min. for low and high interaction scripts respectively. The 
Oracle system used constant bit-rate encoding whereas Elvira used variable bit-rate. 
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5 Technology 

5.1 The Forecasting Problem 

The two architectural approaches which were at different states of development when 
the project began in 1996. The MPP approach was represented by a commercially 
available system nCUBE 2, whereas the Cluster approach was represented by a 
research prototype (Elvira). These systems provided the basis for the investigations 
done in the early and middle phases of the project. We wished to explore the impact 
of technology development under two sets of assumptions corresponding to: 

7. Technology commercially available at the analysis stage of the project (Ql-1998). 
2. Technology likely to be available at a similar cost within the foreseeable future 
( 2001 ). 

We restricted ourselves to developments already announced within the industry. The 
forecasting problem differed somewhat for each architecture. Eor MPP we were 
dealing with commercially available complete systems', we took the MediaCUBE 
nCUBE3 architecture as the 1998 reference point, and made a prediction based on 
current hardware trends for the 2001. Eor the Cluster architecture we were concerned 
with off-the-shelf availability of hardware. We based the 1998 reference point on 
current products towards the low end of the cost spectrum and assumed that today’s 
high-end products would be driven down to normal price levels within the prognosis 
period. This approach enables us to deal largely with known performance 
characteristics, albeit assumed price levels. 

Improvements in hardware technology, both in terms of raw performance and 
price-performance, do not always occur at the same rate. Eor example, CPU power 
and memory economics may outstrip improvements on the i/o side, leading to 
alterations in fundamental design trade-offs. The implications for analysis may be 
classified as 

• incremental change whose effect can be predicted by routine modification analysis 

• radical change involving significant architectural or software differences which 
need special investigation. 

Radical changes, e.g. huge increases in RAM size, have an impact more difficult to 
quantify. We left such changes out of scope, except where there were already clear 
announcements, as with UPA and MediaCube interconnection architectures. 



5.2 Technology Reference Points^ 

Communication Rate 

This factor is common for the two architectures. Table 1 shows expected ATM 
characteristics for the baseline systems and the two projections. The leap in ATM 
bandwidth between 1998 and 2001 will create higher expectations with regard to 
picture quality, so we also assume a doubling of the video transmission rate required. 



^ As prepared by first quarter 98. The 2001 technology forecast may well be conservative. 
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Table 1. Technology reference points (communications) 



Common 

Factors 


Quantity 


Basel, 

Base2 


Projection 
(1) 1998 


Projection 
(2) 2001 


video stream 


transmission rate Mb/s 


2 


2 


4 


ATM capacity 

(block size 47KB) 


speed Mb/s 


155 


155 


622 


useable bandwidth Mb/s 


120 


120 


480 



Cluster Architecture 

The reference systems chosen are presented in Table 2. Basel was used for the so- 
called micro-benchmarks. These provided single-thread measurements which guided 
early projections [1]. Later a multi-thread prototype became available on the Base2 
machine, which provided a more satisfactory baseline for further projections. 



Table 2 Technology Reference Points (cluster architecture) 



1 Cluster 




Base 1 


Base 2 


Projection 


Projection 


1 Architecture 






1 (1998) 


2 (2001) 






Sparc20 clone 


UltraSparcl 


UltraSparcl 


UltraSparc3 






125 MHz 


167 MHz 


167MHz 


600 MHz 


CPU 


Specint95 


3.31 


6.4 


6.4 


35 


Memory 


Size GB 


0.128 


0.5 


0.5 


1 


Memory 


Type 


Mbus 


UPA 


UPA 


UPA 


Bus 


Speed MB/s 


200 


1300 


1300 


2400 


I/O Bus 


Type 


Sbus 


Sbus 


Sbus 


PCI 




Speed MB/s 


100 


123 


133 


528 


Device 


Type 


SCSI 


SCSI 


SCSI 


FiberChannel 


Channel 


Speed MB/s 


10 


10 


20 


100 


Disk 


Type 


Seagate 


Micropolis 


Seagate 


Prognosis^ 






Barracuda 


3243 


Cheetah 




Size GB 


4 


4.3 


4.55/9.1 


36 




Speed MB/s: 
Range 


6-9 


5.75-10 


11.3-16.8 






Mid-point 


7.5 


7.9 


14.0 


22.8 




Av.seek ms 


8 


8.9 


7.5 


5.2 




Latency ms 




4.17 


2.99 


2.99 



6 Measurement Study 



The system measured was a dual processor version of the Base 2 technology reference 
point (Table 2). The workload was emulated on a separate work-station. Emulated 



^ Assumes 60% per yr. improvement in packing density and transfer rate proportional to square 
root of density. Seek time and latency reflect current “best of breed” (over-conservative). 
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clients (users) are controlled by two scripts, generated off-line. The Driver Script 
determines when each client should be invoked, and Client Scripts control the 
sequence of commands as described in section 4.1. 

The Video Pump delivers a small number of streams to a Viewer. Most of the 
streams are thrown away. Text-based information (timestamp, show time, movie time, 
container type, and sequence numbers) about selected streams is recorded on a Log 
File. The output is processed by a Perl script which calculates frame error rates, error 
bursts, interarrival times between each complete video frame and response times for 
the individual client commands. 

Disk utilisation was measured with the UNIX command iostat. Measuring CPU 
usage with iostat is not reliable, so the system call getrusage was used. Although we 
measured a dual processor, the utilisation reported is the single processor equivalent. 




Fig. 5 Device utilisations and frame-loss in ‘pl^y’ mode 



6.1 Streaming Capacity in Play Mode 

The results with a single disk, two disk threads and a single “play” command are 
plotted in Fig.5. We observed the device utilisations and frame loss rate for an 
increasing number of streams. Loss rate varied from stream to stream. Only the 
maximum loss rate is plotted. The sharp "knee" in the curve for maximum loss rate 
indicates the point at which full streaming capacity is reached. 

Device utilisation increases linearly over the middle range, the constant slope of 
the curves giving the device utilisation per stream. Measurements for other scripts 
follow a similar pattern. 




286 



P.H. Hughes and G. Brataas 



6.2 Effect of User Interaction on Streaming Capacity 

Table 3 below shows the number of streams supportable with an acceptable frame 
loss rate for different scenarios. The mean utilisation is for the complete script, while 
the max-mean utilisation is the highest moving-average over 5 seconds. All scenarios 
except the last used a single disk. 



Table 3. Behaviour at quality limit 



Script 


# 

streams 


Lossrate, first 
(last) stream 


Bottleneck utilisation 


Mean 


max-mean 


“play”, 1 disk 


13 


0.03 % (0.31 %) 


Disk (85.6 %) 


Disk (95 %) 


“ff/fb”, 


15 


0.00% (0.16%) 


Disk (80.7 %) 


Disk (89 %) 


high interaction 


15 


0.04 % (0.77 %) 


Disk (81.1 %) 


Disk (95 %) 


“play”, 3 disks 


28 


0.18% (0.51 %) 


SCSI (70 %) 


SCSI (81 %) 



From Table 3 the number of streams supportable with the high interaction script (15) 
is higher than with only the “play” command (13). The high interaction script has on 
average a shorter dwell time in the “play” state, so the probability of all streams being 
in the “play” state at the same time is lower, and the peaks in disk utilisation (max- 
mean column) are fewer and shorter. Medium and low interaction scripts exhibited 
intermediate behaviour. Thus “play” mode provides a conservative estimate of 
streaming capacity. 



6.3 Effect of Variable Bit Rate 

The movie used for the Cluster architecture was encoded with Motion JPEG coding 
and variable bit-rate. The average bit-rate was 2 036 350 bits/s over 70 minutes. The 
highest 1 -minute moving average was 2.6 Mbits/s (Fig.6). During 1 second a peak bit- 
rate of 3.7 Mbits/s was measured. 

The bit rate for a video film with constant bit-rate encoding should equal the mean 
bit rate for the same film with variable bit-rate encoding. The aggregate bit-rates are 
reflected in the CPU and disk utilisations. The ratio of mean to max-mean utilisations 
for “play” (table 3) implies that 14 streams could be supported in the ‘constant’ case 
in contrast to 13 for ‘variable’. Thus a comparison of variable rate capacity (Cluster) 
with constant rate capacity (MPP) will be on the pessimistic side. 



7 Node-Level Analysis (Cluster) 

7.1 Structural Analysis 



A structural model of the hardware and software (Fig. 7) was built using the Sp 
method [6, 7]. The shaded boxes are the subsystems whose resource utilisation is to 
be analysed. The white boxes represent the logic which invokes these resources via 
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the code of the disk and delivery threads. The model was used to analyse the resource 
utilisation of Basel using the micro-benchmarks and helped to ensure that all 
potential resource bottlenecks were examined. Within the bus architecture only the 
SCSI bus was sufficiently busy to warrant inclusion in the subsequent projections."* 





Fig.7. Sp decomposition of video pumps 



7.2 Device-Level Projections 

The projection method consists of three steps at the device level followed by a 
configuration analysis step at the node level. Steps (i) - (iii) are set out in Tables 4-6. 
Results in bold are input to the succeeding step. 



(i) device utilisations per stream (baseline) 

Base2 capacity was measured relatively late in the project. Earlier, multithread 
capacity had been projected from the single thread measurements on Basel, assuming 



"* The Mbus of Basel was replaced in Base2 by UFA, which is much faster (see table 2). 
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an estimated 20% increase in resource utilisation. The projections (Table 4) showed a 
good correspondence, which increased confidence in the method. 



(ii) apply technology projections 

The projections (Table 5) are based on the measured data for Base2, where available, 
modified by appropriate technology factors, derived from Tables 1 and 2. For the 
ATM interface, estimates derived from specifications have been used. 



Table 4 Baseline analysis for cluster 



System 




disk 


CPU 






Basel-single thread 


%busy per stream 




2.90 






multi thread(estimate) 


% busy per stream 


6.70 


3.50 






Base2-multi thread 


Speed factor; Basel 


1.00 


1.93 






Projected 


%busy per stream 


6.70 


1.81 






Measured 


%busy per stream 


6.68 


2.05 


2.89 


- 



Table 5. Projections for cluster 



System 




disk 


CPU 


SCSI 


ATMi/f 


Projection 1 


speed factor:BASE2 


1.44 


1.00 


2.00 


1.00 




%busy per stream 


4.64 


2.05 


1.45 


1.70 


Projection 2 


speed factor:BASE2 


2.05 


5.47 


10.00 


4.00 


2Mb/s 


%busy per stream 


3.26 


0.37 


0.29 


0.43 


4Mb/s 


%busy per stream 


7.52 


0.74 


0.58 


0.86 



(Hi) device streaming capacity 

Data loss due to queueing delays increases with device utilisation. The level at which 
a device can be driven for a given system and workload before this loss becomes 
unacceptable is termed the Quality of Service (QoS) limit. Device streaming capacity 
is found by dividing the QoS limit by the utilisation per stream as shown in Table 6. 
The QoS limit may be estimated in several ways. For Base2, measurement and 
simulation figures (mutually consistent) were available for disk, and measurement 
data was also available for SCSI.^ The CPU figure is softest, being based upon rule of 
thumb. For the ATM interface, utilisations close to 100% can reportedly be achieved, 
since cells are being sent at a constant rate for each data stream. Table 6 provides the 
basic device-level results from which the streaming capacity of different node 
configurations can be analysed. 



^ Strictly, QoS limits vary with the relative loading of devices. They should be slightly different 
for each line of the table. Given all the approximation involved, this effect has been ignored. 
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7.3 Projections for Node-Level Scalability 

The cluster architecture permits a variable configuration of devices at each node. The 
UltraSPARC architecture permits a limited number of tightly-coupled multi- 
processors and a variety of I/O devices. 

Table 7 provides a guide to node capacity based on the number of streams 
supportable in “play” mode for different node configurations. In the first column, the 
number of streams given is the estimated capacity of the bottleneck device, obtained 
directly from Table 6. Where available, the corresponding simulation result is given 
in parentheses. These were provided by Telefonica, using an in-house simulator. 



Table 6. Streaming capacity per device (play mode) for cluster 





disk 


CPU 


SCSI 


ATMi/f 


QOS limit (%) 


90 


75 


81 


100 


Streams - Basel 


13 


21 


27 


59 


Streams - Base2 


13 


37 


28 


59 


Streams - Projl 


19 


37 


56 


59 


Streams - ProJ2 - 4Mb/s 


14 


100 


40 


117 



Each major band of Table 7 deals with a different reference technology and begins 
with the minimum configuration (1,1, 1,1). Only Projection 1 is shown in full. The * 
starred entry indicates the primary bottleneck. Bottlenecks are progressively removed 
by adding devices. Thus each succeeding row of Projection 1 defines a limiting 
configuration in ascending order of capacity. E.g. 56 streams can be supported by a 
node with 3 disks, 2 processors, 1 SCSI bus and 1 ATM interface. Eor this 
configuration the SCSI bus is the bottleneck. Table 7 is concerned with streaming 
capacity. A complete picture also requires consideration of online storage capacity, 
which affects the number of disks required per node. The latter can only be analysed 
on a system- wide basis, see Section 8.2. 

Such a table could be combined with cost information to find the best system to 
support a given load. A cost can be attached to each row of the table.® The cost- 
effective node configuration is the row which gives the most streams per $. Suppose 
for example that for Proj 1, this is the middle row. Then for a peak load requirement of 
say 1000 simultaneous users at 56 streams per node we would need about 18 nodes. 



8 System-Level Scalability 

8.1 System Streaming Capacity 

Both the nCUBE3 and the Cluster architectures are based on a set of interconnected 
nodes used homogeneously. Eor a given node capacity, the system-level scaling 



® The number of disks will need to be modified to take account of online storage requirements. 
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function is linear. Following Section 3, we consider node capacity, step-size and 
limits. The relationship between these factors for the two architectures at the 
technology reference points is shown in Fig. 8 which is developed from Fig. 3(b). 



Table 7. Configuration table (cluster) 





Number of streams 


Number of devices configured 


disk 


CPU 


SCSI 


ATMi/f 


Basel - 2Mb/s 


13 (13) 


1 * 


1 


1 


1 


56 


5 


2 


2* 


1 


Projection 1 - 
2Mb/s 


19 


1 * 


1 


1 


1 


37 


2 


1* 


1 


1 


56 


3 


2 


1 * 


1 


59 


4 


2 


2 


1 * 


74 


4 


2 * 


2 


2 


Projection 2 - 
4Mb/s 


14 (17) 


1* 


1 


1 


1 


100 (120) 


7 (8) 


1 * 


1 


1 


117 


9 


2 


1 


1 * 




n] 
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8.1.1 Node Capacity 

We consider here the capacity of the nodes used as video pumps. In Fig. 8 each curve 
shows how system capacity grows for nodes of a capacity given by the slope of the 
curve. The four’ cases shown correspond to Projections 1 and 2, for capacities of 60 
and 120 streams. For the cluster architecture, corresponding node configurations are 
obtained from table 7. 

The vertical sections PQ, RS indicate the capacity range of 64 nodes with the two 
technology levels for either architecture. The horizontal sections ab, cd indicate the 
number of nodes required to service 10,000 streams at the two technology levels for 
the cluster architecture. 

For a given technology and node capacity, both architectures follow the same 
hypothetical curve. The analysis in [1] concludes that nCUBE3 is ATM-limited and 
this is borne out by published benchmark results*. Points P and P’ , reflect the current 
capacity of nCUBE3 with 64 and 128 video pump nodes respectively. A 64-node 
nCUBE3 with 2001 technology would lie somewhere along the vertical intercept RS. 

8.1.2 Capacity Increments 

A fully connected hypercube has a fixed number of nodes. As the dimension increases 
by 1, the number of nodes is doubled. E.g. an increase of dimension from 6 to 7 
would yield an extra 64 nodes. If all the nodes are fully utilised by the video pumps, 
this corresponds to a doubling in capacity. Thus the increment size for a hypercube 
increases exponentially. A solution is to allow hypercubes of different dimensions to 
be combined. But nodes will have unequal numbers of neighbours, leading to less 
homogeneous routing. A Cluster architecture increments linearly in units of 1 node. 

8.1.3 System Capacity Limits 

The upper and lower limits of the nCUBE3-architecture are determined by the 
smallest and largest hypercubes which are economically viable. Eor MediaCUBE, 
dimensions up to 8 (256 nodes) have been announced. 

We have not investigated quantitatively the resource usage of command 
processing, or of other system-level processes such as logging and admission control. 
In both architectures these functions are not handled by the video pump nodes. In 
nCUBE 3, a supplementary subcube is added for this purpose*. In the cluster 
architecture the functions are handled by the command server which runs on a 
separate node. Since these ’control nodes’ can be configured independently in either 
architecture, they are unlikely to constitute a practically significant limit to capacity. 
However, heavy interactive work places a greater load on the central Command 
Server application (in either architecture). In very large systems, this could cause 
sluggish response for transient periods, such as when many users are starting up after 
a break. We have not investigated this effect quantitatively. 



’ Only three curves appear because cases 2 and 3 coincide. 

* http://www.ncube.com/products/5kstream.html 

* E.g. 9 additional processors are required to support 128 nodes. 
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8.2 On-Line Storage Requirements 

A separate scaling function is required to relate number of users to storage 
requirements. This is affected by the relative popularity of different titles. If a 
particular title is very popular, access to it may become a bottleneck in spite of 
adequate streaming capacity in the system as a whole. Two strategies are available to 
deal with this: striping which causes each video film to be spread over many disks 
and/or nodes, and replication which permits the more popular titles to exist in 
multiple copies. Each of these strategies has different costs and benefits [12]. A 
combination of the two may be used. 

The number of different titles accessed over a period may be termed the working- 
set. There is empirical evidence that title popularity tends to follow a Zipf distribution 
[9]. This assumption can be used to estimate the size of the working set associated 
with a given number of streams and also the number of streams accessing the most 
popular titles [1]. There must be enough secondary storage to accommodate the 
working-set together with any necessary replications. The storage scaling function 
must take account of these relationships, which are application-dependent. 



9 Conclusion 

We have investigated the scalability of a prototype work-station cluster architecture 
for VoD applications and compared it with a commercially available MPP 
architecture, against a background of technology evolution. During the life of the 
project, radical changes took place in the MPP hypercube architecture, which have 
brought the two architectures closer together. Both are now based on a set of 
interconnected nodes which are used in a homogeneous way. 

Measurements based on workload emulation combined with static analysis turned 
out to be adequate basic tools for the task. Early information and improved 
confidence in quality of service limits was obtained by the complementary use of 
simulation. All of the numerical projections are of course subject to various estimates 
and assumptions identified in the analysis and must be treated with caution. 

Scalability of architectures presents new challenges for the performance engineer. 
The scaling functions are hierachically structured, can involve several dimensions of 
capacity and size and are complicated in practice by the continuous evolution of 
technology and workload requirements. Given the uncertainties associated with these 
factors, we found that for the case investigated, a relatively wide parameter space 
could be explored using standard performance evaluation techniques. 
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Abstract. Cost-effective clusters built from commodity-off-the-shelf 
components and connected with high-speed interconnection fabrics, to- 
gether with easy-to-use shared memory programming models, are cre- 
ating an attractive platform for parallel programming. However, these 
kinds of architectures currently lack monitoring environments that al- 
low the observation of performance data at various levels, detection of 
bottlenecks, and overall optimization of applications. 

This work presents a comprehensive approach attacking the problem by 
combining three basic building blocks: monitoring hardware for a state- 
of-the-art system area network (SCI), an innovative hybrid distributed 
shared memory system providing the base for any kind of shared memory 
programming models (SCI Virtual Memory or SCI-VM), and an exten- 
sible online monitoring system (OMIS/OCM). This forms the basis for 
an extensive tool environment on top of this emerging platform, which 
allows easy application porting, debugging, and performance tuning. 



1 Motivation 

With the rise of PC clusters based on high speed networks like SCI |8I5| . Myrinet 
|3], or GigaNet [2^, the question of their programmability has become increas- 
ingly important to resolve. While the traditional approaches are based on mes- 
sage passing, the more comfortable shared memory model, previously restricted 
to tightly coupled system like SMPs, is gaining acceptance in the form of Dis- 
tributed Shared Memory (DSM) systems [IB]. They provide a single, global 
address space, allow inherent data distribution, and support incremental par- 
allelization and therefore a smooth migration path for the parallelization of an 
application. This ease of use, however, comes at a price. Due to the transparency 
provided by the shared memory layer hiding the in reality distributed memory 
resources, the optimization process for such systems is more complex. Especially 
the most important performance issue of DSM systems, the spatial and temporal 
data locality, can not be observed directly. New concepts for the detection, ac- 
quisition, and evaluation of memory access patterns responsible for bad locality 
have to be developed. 

The SMiLE project (Shared Memory in a LAN-like Environment) targets 
these research issues with both hardware and software efforts for SCI based 



B.R. Haverkort et al. (Eds.): TOOLS 2000, LNCS 1786, pp. 294- [308l 2000. 
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clusters. SCI or the Scalable Coherent Interface | 8I5| is an IEEE standardized, 
state of the art interconnection fabric with link speeds in the newest generation 
of up to 6.4 Gbit/s. The fast communication is provided in a VI- Architecture |4] 
style through a global SCI address space in a Hardware DSM fashion at user level 
allowing for direct remote memory access. This mechanism provides the base for 
the extensive SMiLE software infrastructure [^, enabling both message pass- 
ing and shared memory programming. Especially the latter one directly benefits 
from the existence of hardware DSM mechanisms, which allow the construction 
of a hybrid hardware / software DSM system instead of relying completely on 
software efforts. On the hardware side, this framework is completed by a moni- 
toring system, which enables the user to observe any memory transaction at the 
SCI HW-DSM level. 

Together with the OMIS project [1^ (Online Monitoring Interface Specifi- 
cation), which aims at the design and the definition of an open and extensible 
tool interface for distributed systems, a monitoring concept for shared memory 
programming models has been developed based on the SMiLE monitor. This 
concept takes the multi-layer architecture of such systems into account and mir- 
rors them in the monitoring software. It can be implemented as an extension 
to OMIS without requiring major changes allowing instant access for new and 
existing tools to this programming platform. It therefore provides a powerful, 
extensible, and easy to use tool case for the debugging and the optimization of 
shared memory applications. 

The remainder of this paper is organized as follows. Section |2] introduces the 
hybrid Software / Hardware DSM system, the SCI Virtual Memory, as the base 
for the monitoring efforts, followed by a description of the SMiLE hardware mon- 
itor and its driver infrastructure in Section [31 Section |3| then presents the layer 
structure used for the data acquisition. In the Section |S] the on-line monitoring 
framework OMIS deployed in this work is introduced followed by a discussion of 
how the framework can be extended to monitor hybrid DSM systems in Section 
131 The paper is rounded up by presenting the current status of the work in Sec- 
tion 0 a brief overview of the related work in Section |31 and some concluding 
remarks in Section |31 



2 Shared Memory on Clusters and Its Problems 



The main prerequisite for shared memory programming on clusters is a fully 
transparent, global memory providing applications with the abstraction of a 
cluster- wide global process and a single virtual address space. Within the SMiLE 
project, such a layer, the SCI Virtual Memory or SCI-VM m has been devel- 
oped. It not only provides the required abstractions, but also forms the basis for 
any kind of shared memory programming model on top of SCI-based PC clusters 
It therefore is the key component to open the cluster architecture to the 
shared memory programming paradigm which has been traditionally the domain 
of tightly coupled parallel systems like SMPs. 
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In order to most efficiently utilize the given HW-DSM features of the SCI 
interconnection fabric, the SCI-VM layer is built as a hybrid hardware / software 
system. While the management and configuration tasks are carried out in soft- 
ware in a DSM-like fashion, the actual communication is handled in hardware by 
SCI’s HW-DSM mechanism. The memory is distributed at page granularity and 
these distributed pages are then combined into a global virtual address space. 
This is achieved by mapping local pages conventionally through the processor’s 
memory management unit (MMU) and remote pages through SCI’s HW-DSM 
mappings. There is no need for any page replication or special multiple writer 
protocols. This eliminates many of the typical SW-DSM bottlenecks like false 
sharing and complex differential page update protocols. 

One typical DSM problem, however, remains: the need for good temporal 
and spatial locality. Due to the NUMA characteristics of such an SCI cluster, 
any access to remote memory has a latency overhead of about one to two or- 
ders of magnitudes compared to accesses to local memory. For a good overall 
application performance, it is therefore mandatory for the application to exhibit 
good locality properties. Unlike in message passing systems, however, where any 
communication is done explicitly at a priori known locations, shared memory 
system are based on implicit communication triggered by processor reads and 
writes. It is therefore not predictable and can only be monitored with the help of 
specialized hardware, like the SMiLE monitor described in detail below, capable 
of snooping transactions on the interconnection network. With the help of the in- 
formation acquired with such systems and after their transformation to a higher 
level of abstraction equal to the one of the original application source, it is then 
possible to incrementally optimize the application’s behavior statically (through 
different data and tasks assignments to nodes) and dynamically (through run- 
time mechanisms like data and thread migration). In both cases, however, the 
impact on the application source code itself is minimal only requiring the tuning 
of a few parameters maintaining the comfortable shared memory programming 
model for the user, but potentially resulting in significantly higher performance. 



3 The SMiLE Monitoring System 

The architecture for the SCI-VM consists of PCs each equipped with PCI to 
SCI bridges which plug into the PCI bus and translates PCI requests into SCI 
packets. In the SMiLE system deployed in this work this basic architecture is 
augmented by a specialized hardware monitor depicted in Figure [U which is 
capable of observing any transaction on this bridge. 

The PCI-SCI adapter, described in detail in [T], serves as the interface be- 
tween the PC’s I/O bus and the SCI interconnection networlj/l- It intercepts stan- 
dard processor-to-memory operations on memory mapped I/O address ranges, 

^ We currently use our own PCI-SCI adapter to have more control over implementation 
details. Commercial adapters are available and are built based on the same design 
principles. The system presented here can and will in the future also be applied to 
any SCI networking hardware. 
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generates packets for remote SCI nodes and transparently forwards them to the 
SCI network. Vice versa, incoming packets arriving from the SCI network are 
translated into PCI transactions for the local node. 




Fig. 1. The SMiLE SCI node architecture: a PCI-SCI adapter and the 
hardware monitor card installed in the PC’s PCI local bus. 



As also shown in Figure [H the PCI-SCI adapter is divided into three logical 
parts: the PCI unit, the Dual-Ported-RAM (DPR), and the SCI unit. The non- 
SCI-link side of this chip is the B-Link, a 64-Bit wide synchronous bus connecting 
the SCI unit and the DPR. It carries all incoming and outgoing packets for 
the physical SCI interface being the central sequentialization spot on which all 
remote memory traffic can be monitored. The information that can be recorded 
from the B-Link includes the transaction command, the target and the source 
node IDs, the node-internal address offset, and the packet data. The SMiLE 
monitor therefore uses this link to hook into the PCI to SCI bridge. 

The flexible architecture of the hardware monitor allows the programmer 
to utilize it in two performance-analysis working modes: the dynamic and the 
static mode. The dynamic working mode allows users to monitor the runtime and 
communication behavior within the whole SCI address space. It is suitable for 
delivering detailed informations to tools for performance evaluation and tuning 
without prior application specific information. In order to be able to record all 
data of interest with only limited hardware resources, the monitor exploits the 
spatial and temporal locality of data and instruction accesses in a similar way 
as cache memories do. The hardware monitor contains a content-addressable 
counter array managing a small working set of the most recently referenced 
memory regions. If a memory reference matches a tag in the counter array, the 
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Fig. 2. The monitor’s dynamic working mode principle. 



associated counter is incremented. If no reference is found, a new counter-tag pair 
is allocated and initialized to 1. If no more space is available within the counter 
array, first counters for neighboring address areas are merged or a counter-tag 
pair is flushed to a larger ring buffer in main memory. This buffer is supposed to 
be emptied by the system software in a cyclic fashion. In the case of a ring buffer 
overflow, a signal is sent to the software process urging for the retrieval of the 
ring buffer’s data. Figure [21 demonstrates the working principle of the hardware 
monitor’s dynamic mode. 

In order to reduce the amount of flushing and re-allocating counter-tag pairs, 
it is advantageous to integrate the strategy of counter coverage adaption into 
the cache replacement mechanism. Under the prerequisite that counter tags can 
name not only single addresses but also continuous areas of memory for which 
the counter is to represent the number of accesses, a simple least-recently-used 
replacement algorithm can be adapted to this special task. The maximum ad- 
dress range, however, has to be predefined by the user. 

The static working mode allows users to explicitly program the hardware 
for event triggering and actions processing on freely definable SCI regions. This 
addition to the monitor design was chosen to allow the user to include preknown 
memory areas that are supposed to be monitored regardlessly, much like in 
conventional histogram monitors m Figure |5] shows a simplified view of the 
hardware structure used to realize this feature. 

All incoming transactions are handled through an eventfllter which comprises 
a page table and an event station. In combination both implement the ability 



Multilayer Online-Monitoring for Hybrid DSM Systems 299 



SCI 

address 

space 

“page] 



^ pagej+1 
node i 



pagej+2 



page table 



nodenum 


pagenum 




i 


i 


state 








V 











event station 



page frame 


top 


bottom 


transaction 

type 


page frame 


top 


bottom 


transaction 

type 










X 




U* kl 1 1 1 1 1 1 



SMILE hardware monitor 



- active event 

- readsb 

- writesb 

- locksb 
-request 
-response 

- in/out 

- codex 



PCI/SCI adapter 



event counting (performance analysis mode) codex mode 



SCI transaction 
and buffer 
management 



static counter array event selector 



counter 0 


- 


enable 


disable 


count 


counter 1 


- 














counter k- 1 


- 


1 1 



Fig. 3. The monitor’s static mode hardware structures. 



to monitor memory regions and particular transactions upon them. The event 
stations specify the exact events on which the monitor triggers. An entry within 
this hardware structure points to a page descriptor within the page table. The 
bottom and the top address fields specify the range within the indexed page 
while the transaction type describes the relevant operation to that page. 

For the performance analysis mode, each counter within the static counter 
array is associated with three possible events selected from the event station. As 
a counter comprises the three actions of being enabled, being triggered to count, 
and being disabled, three distinct events can be specified. 



4 Layers of Data Acquisition 

The data delivered from the hardware monitoring system alone, however, does 
not suffice for the evaluation of applications. It is too low-level and misses the 
necessary information that would allow to bridge the semantic gap between the 
low-level hardware events observed in the living system and the application layer 
visible to the user. In order to overcome this shortcoming, a monitoring system 
has to utilize the layered implementation structure of the overall system. Each 
of these layers contributes information to the monitoring system resulting in a 
flexible hierarchical structure which enables the transformation of acquired low- 
level information into a user readable form at application level. The following 
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section introduces these layers together with the information they provide and 
how they can be applied for the required transformation. 

4.1 Hardware Monitoring 

The lowest layer in the system is represented by the SMiLE hardware monitor 
and its low-level driver software as it has been described above. The information 
delivered is very detailed, low-level, and purely based on physical address infor- 
mation [S]. Although this layer is the main data source for the overall monitoring 
system, these properties of the data delivered prohibits its direct evaluation at 
application layer as well as the parameterization of its programmable logic at 
runtime. 

4.2 Status Information from the Hybrid HW/SW-DSM System 

The main information missing for an evaluation of the data acquired from the 
SMiLE hardware monitor is the translation of physical addresses seen by the 
monitor into virtual addresses visible from the application processes. In a sys- 
tem with a global virtual memory abstraction, as discussed in the work, this 
translation is controlled by the DSM layer, in this case the SCI-VM. It extends 
the virtual memory management of the operating system to a cross-node mem- 
ory resource control and is responsible for setting up the correct virtual address 
mappings to both local and remote memory. It therefore is able to directly pro- 
vide the required information to the monitor system. 

The mapping information, however, is highly dynamic as they are demand 
driven according to the memory access pattern of the application. Due to the 
limited resources in physical memory and mapping entries on SCI cards, they 
are also subject to swap-like eviction processes. To compensate this, the monitor 
has to be capable of receiving of asynchronous updates. These will be sent when- 
ever a mapping information changes due to one of these dynamic mechanisms 
guaranteeing a constantly correct information in the monitor. 

Besides the pure mapping information, the DSM layer can provide further 
information that can aid in performance evaluations of shared memory applica- 
tions. It can be extracted from statistics collected during the dynamic events of 
the SCI-VM and includes data about mapping invalidations, mapping misses, 
and access history. This forms the basis for an in depth invalidation of the 
application’s memory access patterns and therefore provides an important con- 
tribution to the monitoring system. 

4.3 Instrumenting the Synchronization Primitives 

While the two layers described above cover all information required for the mon- 
itoring and the evaluation of the cluster-wide shared memory subsystem, they 
lack the ability to monitor any kind of synchronization primitives. Those, how- 
ever, generally have a critical performance impact on shared memory applica- 
tions. In order to enable the user of the monitoring system to observe their 
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impact, the SyncMod module containing generic synchronization constructs like 
locks, barriers, and counters delivers various statistical information to the mon- 
itor. This ranges from simple lock and unlock counters, to more sophisticated 
information like mean and peak lock times, mean and peak barrier wait times, 
and average and maximal queue size for resource acquiring operations. This in- 
formation will allow the detection of bottlenecks in applications and forms the 
base for an efficient application optimization. 



4.4 Programming Model and Environment Specific Information 

So far only the monitoring of raw shared memory applications is covered. This 
is done at the level of the SCI-VM. This layer, however, is generally not used 
by programmers for application development, as it is too low-level and generic. 
As described in Section it is merely a base for the development of arbitrary 
shared memory programming models. These programming models add func- 
tionality, normally for easier configuration and higher data abstraction, to the 
SCI-VM while at the same time restricting the general features of the SCI-VM. 
The result is a programming abstraction that is customized to the user’s de- 
mands and preferences. Programming models implemented in this style can be 
standard APIs ranging from distributed thread models to DSM APIs like the 
one of TreadMarks |2], but also form the basis for the implementation of more 
complex programming environments like CORBA, COOFq and OpenMP. 

In order to cope with the different abstractions provided by the different 
programming models, the monitoring infrastructure has to be extensible in the 
sense that it can be adopted to relate the information acquired by the lower layers 
to the applications level of abstraction. This enables the monitoring system to 
be adopted to any kind of programming model providing specific information to 
user at the level of the application instead of only showing monitoring results 
at a generic global memory level. This will simplify the users task greatly and 
provide an adaptable, easy to use monitoring solution. 



4.5 Unifying to a Single Abstract Model 

FigurelHoutlines the layered architecture of the SMiLE DSM system and its mon- 
itoring components. As described, the lower three layers consist of the SMiLE 
hardware, the SCI-VM virtual memory system, and the synchronization module, 
SyncMod. These lower three layers exist in an unmodified way in and indepen- 
dently from any higher level programming model. Various higher level program- 
ming models like globally distributed threads or hardware assisted TreadMarks- 
like shared memory can be implemented on top of them. In the case of very high 
level distributed programming environments like Split-C, parallel C-| — h, or dis- 
tributed CORBA objects, another layer can be added above the programming 
model. 

^ Concurrent Object Oriented Programming 
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Fig. 4. Multilayer DSM architecture with SMiLE 



To obtain useful performance data in this environment, monitored data from 
the different levels of abstraction have to be combined. The gathering of mon- 
itoring data follows the same layer model as described above. At the bottom 
is the SMiLE monitor. It observes SCI transactions and gathers statistical and 
performance data at the level of physical addresses and physical node numbers 
of the PC cluster. This data has to be translated according to information from 
the SCI-VM layer to obtain performance data for virtual cluster global addresses 
and address ranges. In addition, statistical information about address transla- 
tions is gathered from the SCI-VM layer. For the SyncMod layer, information 
about locks, barriers, etc. has to be related to data from the lower level layers. 
Finally, the resulting data have to be mapped to the abstraction level of the 
user’s programming model which is represented by the highest layers. 

If a complex layered model like the above shall be monitored, monitoring 
data have to be gathered at various levels and combined to give a single global 
image. Similarly, configuration information for the SMiLE hardware monitor 
has to be translated, only in opposite order. The decision about what parts of 
the application are most interesting to monitor in detail is often made at the 
abstraction level of the application. The corresponding information then has to 
be translated onto low level information like hardware addresses layer by layer. 
This is not possible with simple traditional monitoring approaches. However, the 
OMIS/OCM monitoring system developed at LRR-TUM provided an excellent 
basis that could easily be adapted to monitor our SMiLE environment. 

5 On-Line Monitoring with OMIS 

OMIS (On-Line Monitoring Interface Specification) [1^ is the specification of 
a interface between a programmable on-lin^ monitoring system for distributed 

^ On-line monitoring, in contrast to off-line monitoring, allows tools to come into 
operation while the application of interest is running and allow not only to observe, 
but also to manipulate the running target system. 
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computing and the tools that sit on top of it. Because OMIS decouples the 
implementation of the tools and of the monitoring system, it can improve the 
portability and availability of tools on different platforms. 

OMIS has also been implemented as an OCM (OMIS Compliant Monitor) at 
LRR-TUM. OMIS and OCM are designed to be powerful and general enough to 
support a wide variety of tools, from performance analyzers over debuggers to 
load balancers and visualizers. To support this versatility, and in order to be able 
to support newer tools and different programming environments, the interface 
as well as the monitor are extensible. OMIS comprises a core of services that are 
independent from the underlying platform and programming environment plus 
optional extensions that cover platform or programming environment specific 
services or special services for a specific tool. 

Extensions can introduce new objects, services and knowledge about seman- 
tics into the monitoring system. These objects and services are then accessible 
to the tools via the standard OMIS monitoring interface. The first extension 
that has been implemented was for the PVM distributed message passing en- 
vironment. Other extensions offer special performance analysis tasks, tracing 
functionality, source level support, checkpoint/restart, and several more. DSM 
and CORBA extensions are subject of ongoing research. 

6 Bringing It All Together 

OMIS’ and OCM’s design and versatile structure and the SMiLE DSM layered 
architecture are an almost perfect match when it comes to on-line monitoring. 

The OMIS core and its implementation in OCM are completely independent 
from the programming environment. They are always the same, independent 
from whether a message passing environment like PVM has to be monitored or 
a shared memory environment like the one discussed here. Since the core only has 
generic interfaces to node local resources, it can be directly used without change. 
Programming environment dependent functionality is provided by means of ex- 
tensions. In addition, extensions can provide functionality for specific tools, like 
e. g. the DETOP extension that introduces source level single stepping into the 
monitoring system and that is used by the OMIS/OCM version of the DETOP 
m debugger. 

Extensions are also used to monitor our SMiLE system. The basic monitoring 
architecture and the corresponding extensions of our SCI-DSM monitor system 
are depicted in figure |5] 

The OMIS/OCM core covers all generic node local resources. One monitor 
extension, the OMIS SCI-DSM extension, is responsible for monitoring and con- 
trolling the lower three layers of the system. It uses interfaces provided by the 
OMIS/OCM core plus interfaces from the SMiLE monitor drivers, the SCI-VM, 
and the SyncMod layers. Statistical and performance data are obtained from 
the SMiLE hardware monitor. As a platform specific extension the SCI-DSM 
extension includes knowledge about the special semantics and utilizes memory 
mapping information from the SCI-VM layer to map this data from physical to 
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Fig. 5. Multilayer SMiLE DSM monitoring with OMIS/OCM 



logical cluster addresses. In addition, it gathers statistical data about address 
translations. Similarly, data obtained from the SyncMod layer is related to the 
low level performance and statistical data from the hardware monitor. Thereby, 
semantically higher level and more useful performance data is computed. 

The resulting higher level information can then be obtained from the OMIS/- 
OCM monitor via standard monitor services that are implemented by the SCI- 
DSM extension. The interface is identical to the one described in [13]. 

To configure the SMiLE hardware monitor, information can also be down- 
loaded from a tool via the standard OMIS monitoring interface similarly as it 
is e.g. done with source line information obtained from the symbol table in the 
DETOP debugger extension. This information can be specified at the higher 
level and is automatically transformed to its low level specification and sent to 
the SMiLE hardware monitor by the OMIS SCI-DSM extension. 

To monitor the higher level layers, like distributed threads or parallel C-|— L, 
additional monitor extensions can be then brought into OMIS/OCM. These uti- 
lize the already present services from the underlying OMIS SCI-DSM extension 
by calling them via functionality provided by the OMIS/OCM core. These exten- 
sions can provide abstractions like cluster global processes, distributed threads, 
or C-l— I- objects. 

7 Current Status and Future Work 

The overall SMiLE/ OMIS monitoring system is still part of active research and 
therefore currently not yet completely implemented. Most of the individual parts, 
however, exist or are in an advanced prototype stage. 
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The SCI-VM system exists in a static version, i.e. without dynamic remap- 
ping in case of resource limitations, for Windows NT and Linux. First experi- 
ments with the system show the benefits of such a shared memory environment 
on top of a SCI based cluster, but also exhibit the vulnerability to bad locality. 
Some large scale application tests have already been started (see |2^), but will 
be intensived in the near future when the full implementation is available. The 
SCI-VM is already today augmented by a complete SyncMod implementation 
provided the necessary synchronization primitives for shared memory program- 
ming. On top of the SCI-VM infrastructure several programming models already 
exist, most notably a SPMD programming model, the TreadMarks API [2], and 
a subset of the Win32 thread API [1 6j . 

The design of the SMiLE monitor is complete and the hardware development 
is currently in its last stage. However, extensive simulations using a software SCI 
cluster simulator have undertaken [^. They show the drastic performance im- 
provements that can be gained by using a monitor driven optimization approach. 
In these simulations, the transformation of the low-level data into the higher 
level have been done with a application specific prototype of the OMIS/DSM 
extension, however, without the strict hierarchical structure introduced in this 
work. 

The OMIS/OCM core monitor is fully operational and running under SUN 
Microsystems’ Solaris, Silicon Graphics’ IRIX, IBM’s AIX, and the Linux op- 
erating system. Several extensions like e.g. the PVM extension, already exist. 
An experimental CORBA monitoring system has also been implemented. The 
SCI-DSM extension, however, is currently still under development. Higher level 
extensions for distributed threads, TreadMarks, SPMD, OpenMP, or parallel 
C-|— I- are only in a conceptional phase. 

On top of OMIS/OCM, various tools that are part of The ToolSet project 
1^ exist and have been ported to OCM: a performance analyzer (PATOP), a de- 
bugger (DETOP), a visualize!' (VISTOP), and a checkpointing tool (CoCheck). 
A deterministic execution tool (Codex) is also in preparation. Currently, these 
tools only exist for message passing environments. However, once the SCI-DSM 
version of OMIS/OCM is available and provides suitable abstractions, many of 
the tools can be relatively quickly ported to DSM. In addition, future OMIS 
compliant DSM tools can be developed independently from the underlying com- 
munication system and can e. g. be used with software shared memory systems 
like standard TreadMarks as well as with TreadMarks on top of SCI-DSM. 

8 Related Work 

Hardware Monitoring in Clusters 

Hardware monitoring support in general is currently mostly restricted to hard- 
ware counters incorporated into the individual CPUs. These counters have the 
ability to collect information about data accesses, cache misses, and TLB misses. 
In addition traffic monitors in network adapters and switches delivering an 
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overview over the total traffic. More sophisticated, detailed, and global per- 
formance monitoring facilities are currently, to our knowledge, not commercially 
available. 

Only a very small number of research projects targets the increasingly impor- 
tant cluster architecture. Manzke et al. [14] present a monitoring hardware also 
for SCI which allows deep traces of SCI traffic, however, without the option of on- 
line address range selection and histograms like the SMiLE monitor. Martonosi 
et.al. [1^ propose a monitoring approach for the SHRIMP multiprocessor based 
on Myrinet [3]. However, this approach is implemented in software only as an 
additional Myrinet Control Program (MCP) run by the LANai processor and 
not directly in hardware m- 



DSM Systems 

Work on shared memory models for clusters of PCs is mostly done with the help 
of pure software DSM systems like TreadMarks j2| and Brazos |2^. Only very 
little work has been done on direct utilization of HW-DSM for global memory 
abstractions in clusters. An examples for this kind of work, which is also based on 
SCI, is SciOS El. a system designed for swapping into remote memory regions 
while providing System-V style shared memory. 



Monitoring Infrastructures 

Traditionally, on-line monitoring infrastructures have been tightly coupled with 
the corresponding tools. There are only few generally applicable monitoring sys- 
tems for distributed systems. In the area of performance analysis, Paradyn [m 
with its dynamic instrumentation library, dyninst m is of great importance. 
Dyninst is also used by DPCL (Dynamic Probe Class Library) |2Dj, a distributed 
monitoring system developed by the IBM Corporation. 

However, today these monitoring systems do not target DSM environments. 
An extension concept like in OMIS/OCM does not exist. Therefore, they are not 
similarly easily adaptable. 

9 Conclusions 

In this paper, a comprehensive and extensible monitoring infrastructure for a 
hybrid hardware / software DSM environment on top of an SCI based cluster of 
PCs has been presented. It is based on the already existing on-line monitoring 
system OMIS/OCM and extends it to cope with the various layers of abstrac- 
tion in a DSM system. Performance data gathered by a configurable hardware 
monitor at a rather low level is mapped onto higher abstraction levels to be 
most helpful to the user. In addition, several other sources of performance in- 
formation, like from synchronization mechanisms, are being integrated into the 
monitoring system. This results in a powerful base for on-line monitoring tool 
sets allowing an easy and thorough evaluation and optimization of applications. 
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By choosing a standardized monitoring platform like OMIS, the emerging 
DSM platform can profit from already existing tools as well as from future 
tool developments on any OMIS compliant platform. This guarantees a rich 
tool environment for the SMiLE architecture and is therefore one of the major 
fundaments for the success of DSM systems beyond pure use in research. 
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Abstract. E-commerce applications are growing at unprecedent rates, 
resulting in overloaded sites with poor quality of service. The workload 
intensity of an e-commerce site is not totally predictable given that exter- 
nal events can generate load spikes that exceed by far the average load. 
Therefore, e-commerce site managers need to be able to understand the 
performance of the site and be able to tune it to cope with varying traffic 
patterns. In this paper we present PROFIT, a new tool for profiling the 
performance of e-commerce sites. PROFIT measures both throughput 
and response time and breaks down the response time in terms of com- 
ponents (e.g., Web server, application server, and database server) and 
services (e.g., search, browse, select, add to cart, and pay). To illustrate 
the use of the tool, the paper shows an analysis of performance and se- 
curity in e-commerce applications, measuring the impact of the Secure 
Sockets Layer (SSL) protocol on the request response time. 



1 Introduction 

E-commerce is one of the most important applications on the Internet and at- 
tracts a very large number of users. It is not uncommon for e-commerce sites 
to see load spikes of 6 to 10 times the average. The increase in workload inten- 
sity may occur due to the natural growth rate of e-commerce activities combined 
with events such as marketing campaigns and launching of new products and ser- 
vices. These surges usually saturate servers and networking resources, cause site 
outages, and drive away customers who face long response times. E-commerce 
sites are signiDcantly diDerent from traditional web servers, both in the nature 
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of the workload and their architecture. The workload of e-commerce sites has 
to be characterized in terms of sessions, which are sequences of interrelated re- 
quests |6]. In order to improve performance and availability, the architecture of 
e-commerce sites is usually organized in multiple layers of software components 
and servers. This implies that an e-commerce transaction is executed by many 
components, such as web servers, application servers, and database servers [2]. 
The processing costs at each of these components vary widely according to the 
functions executed. Also, the need for security in e-commerce transactions adds 
an extra load to e-commerce sites. Security protocols can consume signiDcant 
amount of processing resources [J. Transactions may experience congestion at 
each of the various servers as they queue for both hardware resources (e.g., pro- 
cessors and disks) as well as software resources (e.g., slot in a TCP listen queue, 
http thread). 

The workload intensity of an e-commerce site is not totally predictable since 
external events can generate load spikes that exceed by far the average load. 
Therefore, e-commerce site managers need to be able to monitor the performance 
of the site, in order to understand their bottlenecks and to tune the system 
to cope with varying traD c patterns. For this purpose, adequate monitoring 
tools are needed. This paper presents PROFIT, a new tool for proDling the 
performance of e-commerce sites. PROFIT has the following characteristics. It 
measures both throughput and response time and breaks down the response 
time in terms of components (e.g., web server, application server, and database 
server) and services (e.g., search, browse, select, add to cart, and pay). 

The rest of the paper describes in details the tool and shows examples of its 
use. Section 2 discusses the typical architecture of complex e-commerce sites. Sec- 
tion 3 presents the approach used to measuring the performance of e-commerce 
servers. This section describes the architecture of the tool, the measurement ap- 
proach, the metrics obtained, and shows a screenshot of PROFITD; GUI. The 
next section discusses two examples of the use of the tool. One example shows the 
overhead costs introduced by the use of secure connections through the Secure 
Sockets Layer (SSL) protocol [^. The other example compares the processing 
cost of the e-commerce application when dealing with US customers versus in- 
ternational customers. Finally, section 5 presents the concluding remarks and 
discusses directions for future work. 

2 Architecture of E-commerce Sites 

From the functional viewpoint, e-commerce sites are usually organized into lay- 
ers that perform classes of services. Typical services can be grouped into the 
following categories: 

Presentation: It is the front-end of the electronic store. Standard functional- 
ities available at this level include: satisfying requests for static documents 
and images, redirecting requests to the proper component, and ensuring ac- 
cess security. This layer is implemented by Web servers that consist of both 
secure and non-secure HTTP servers. 
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Business Logic: It implements all application-related services. In the case of 
a bookstore, for instance, this layer implements services such as browse, 
search, select, add to shopping cart, and payment. This layer also keeps 
some temporary information such as customer sessions. The component that 
implements the business logic is called an application server. 

Database services: It provides persistent and reliable storage for the siteQj 
data. Typically, database servers are used to keep information about items in 
the catalog, customer identiD cation, customer proDle, inventory, and orders 
in progress. 

2.1 Implementing E-commerce Services 

The processing Dow of a request in an e-commerce site is illustrated in Fig.[TJ A 
customer request arrives at the site as an HTTP request, which is then handled 
by the Web server. The request may be a request for a static document (e.g., an 
html page or an image) or it may be a request to execute an e-commerce function. 
In the latter case, the Web server invokes the function at the e-commerce server, 
through a CGI-like script or a servlet [2]. The e-commerce server performs the 
requested service and most of the time needs access to data stored in the database 
server. 




Fig. 1. Logical Flow of Requests in an E-commerce Architecture 



While the three components mentioned above may all reside in a same com- 
puter, this is not the case in high-volume e-commerce sites. In large sites, separate 
computers are dedicated to speciDc components of the e-commerce architecture. 
In many cases, there may be many instances of the same component (e.g., vari- 
ous Web servers), each running on a separate machine. In many cases, separate 
LAN segments are used to interconnect all machines that support the same type 
of component. Fig. [2] shows a typical architecture used in large e-commerce sites. 



2.2 Performance and Congestion in E-commerce Sites 

Two basic traditional performance metrics are used to quantify the performance 
of e-commerce sites; throughput and response time. Throughput measures the 
rate at which operations are performed. One very broad measure of throughput 
used by Web sites is HTTP requests/sec. HTTP requests to e-commerce sites 
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E-commerce Server 




Fig. 2. E-commerce Site Architecture 



are associated with requests for execution of e-commerce services. These services 
have a large variance in terms of the resource demands placed on the siteQi in- 
frastructure. For example, a request to search for items in the storeLfe catalog 
is bound to consume more resources than a request to retrieve the siteLfe home 
page. Thus, throughput could be measured in terms of speciDc e-commerce op- 
erations executed per time unit. Examples could be searches/sec, checkouts/sec, 
or browse operations/sec. 

Response time measures the time needed to answer a customer request. The 
customer response time is a sum of diDerent components, namely; the site re- 
sponse time, the network latency and the customer!! browser formatting time. 
In this paper we focus on measuring the site response time, which is a function of 
the site architecture, the server capacity, and the software structure. We divide 
the factors that aDect the site response time into the following groups; 

Computation Time: This is the total time spent executing the requested task 
at the various processors involved in it. 

Waiting Time: Waiting times are due to contention for physical resources and 
contention for software resources. The former results from the fact that a 
service is delayed because it needs to use a hardware resource that is being 
held by another service. Software contention delays stem from tasks that 
have to queue up for software resources, such as database locks, semaphores, 
or threads. 

Communication Time: This group comprises network delays associated with 
communication among the various components of a site architecture. 






A Tool for Measuring the Performance of Complex E-commerce Sites 



313 



2.3 Characteristics of E-commerce Workloads 

A customer interacts with an e-commerce site through a sequence of interrelated 
requests to the site. These requests constitute a session. During a session, a 
customer requests e-commerce services such as browse, search, select, add to 
the shopping cart, and pay. The navigational pattern of a customer or group of 
customers can be characterized by a Customer Behavior Model Graph (CBMG) 
as described in |4I6| . A CBMG is a graph in which nodes correspond to e- 
comnierce services and arcs correspond to transitions from one service to the 
next in the session. Arcs are labeled by a pair of the type (p' z) where p is the 
probability that the transition occurs and z is the average think time. In order 
to characterize and model an e-commerce workload, one has to: 

D identify the diDerent types of sessions that compose the workload and gen- 
erate one CBMG per session, 

D calculate the workload intensity parameters, such as session arrival rates, 
and 

D determine the resource usage parameters for each type of service request 
(e.g., search, browse, select, pay, etc). 

A monitoring tool for an e-commerce site should aim at obtaining the data 
needed to calculate the parameters that characterize the siteOi workload. 

3 Measuring the Performance of E-commerce Server 

PROFIT, the tool described here, monitors the behavior of e-commerce servers 
to provide data that support capacity planning and performance tuning activi- 
ties m- By capturing server-side response time, broken down into components 
and services, PROFIT provides information to explain the sources of poor per- 
formance degradation, such as contention, communication bottlenecks, and exe- 
cution overhead. The following subsections explain the measurement approach, 
the architecture of the tool, and its interface. 



3.1 Measurement Approach 

Basically, there are four factors that iiiDuence the measurement approach used 
in the design of a tool to monitor the performance of an e-commerce site. 

Heterogeneity: e-commerce sites are composed of diDerent components, such 
as HTTP servers, application servers, and database servers. The performance 
behavior of each component depends on its capacity and the characteristics 
of the workload viewed by the component. For example, the Apache server 
keeps a pool of httpd processes running to answer connections. The number 
of processes in the pool may increase or decrease according to the request 
arrival rate. 
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Multiprogramming: a single host may satisfy several requests simultaneously. 
Multiprogramming is implemented through the allocation of one process per 
request (e.g., event-driven processes or thread-based processes). Multipro- 
gramming is one of the main reasons for software and hardware contention. 
For example, devices such as the CPUs and network interfaces are usually 
shared by many processes. 

Workload diversity: a customer session is composed of several requests that 
diDer signiDcantly in terms of resource usage. For instance, the cost of a 
search is a function of the query complexity and its resource usage may vary 
widely from an operation such as add-to-cart. 

Monitoring cost: a measurement tool should not be intrusive, in the sense that 
it does not disrupt the performance of the component being monitored [S]. 

Each component was instrumented at the source code level to collect mea- 
surements concerning process execution. Information about the service being 
executed is obtained from the request and is recorded during its execution at 
each component. Basically, PROFIT collects two measurements for each code 
segment; processing costs and elapsed time. The processing costs quantify the 
processor time spent in executing the code segment, while elapsed time provides 
the wall clock time for that same execution. The performance measurements of 
an e-commerce site are organized into the following categories: 

D Component. It identiDes in which component the measurement was collected. 
It helps to locate performance problems. For instance, if the system adminis- 
trator notices that all DBMS operations are quite slow, he/she may upgrade 
the DBMS machine. 

D Service. It identiDes the nature of the request (e.g., search, browse or login) 
being measured. Knowing which services are more expensive may guide im- 
plementation enhancements. Notice that services are a function of the nature 
of the business application implemented by the server. 

D Phases. The execution of a service usually comprises several phases per- 
formed in each component. For example, the processing time of an HTTP 
request by a Web server could be broken into three phases. The parsing 
phase begins right after the establishment of the connection and ends when 
the header of the request has been parsed and is ready to be processed. The 
processing phase covers the time actually spent processing the request. The 
logging phase corresponds to the time spent performing standard HTTP 
logging. After logging, a process is ready to process a new request |T] 

3.2 Architecture of the Tool 

In this section we describe the architecture of PROFIT. This tool allows site ad- 
ministrators to measure, analyze, and visualize the performance of e-commerce 
sites. Performance analysis with PROFIT usually comprises three steps: (1) mon- 
itoring, (2) analysis, and (3) visualization. Next we describe the implementation 
of the Drst two steps. The visualization is described in Section nT731 
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In order to demonstrate the use of PROFIT, we implemented a prototype 
that analyzes the performance of electronic stores composed by three software 
components; 

Apache: Apache is a very popular Web server that supports several protocols 
and secure access. 

Minivend: Minivend is a Perl-based commerce server. It supports several stores 
and provides several facilities such as tag-based templates for specifying store 
contents. 

MySQL: MySQL is an open source database management system that acts as 
the database server. 

The stores implemented by using Minivend usually provide seven distinct 
services: 

D Home: The entry page to the store. 

D Browse; The customer selects and looks at pre-selected items. 

D Search: The customer searches items using a keyword-driven query. 

D Add: The customer adds a product to his/her basket for later purchase. 

D Login; The customer registers in the store for purchasing an item. 

D Checkout: The customer is ready to order, and provides payment and deliv- 
ery information. 

D Place order; The customer has chosen the items he/she wants and is ready 
to complete the purchase. 

As mentioned before, the instrumented code collects three types of measure- 
ments associated with each component, service, and operation: user time, system 
time, and elapsed time. 

Web Monitor. The Web Monitor collects and analyzes data from the Apache 
server. We distinguish three operations that are performed by an Apache server 
while working on a request [1] : 

Parse request: Receives and parses the client request. 

Access check: VeriDes whether the client or origin server can issue requests to 
the store. 

Handle request: Perform the operations speciDed by the request, which may 
be reading a static object from a local Die, or dispatching a request to the 
transaction server. 



Commerce Server Monitor. We distinguish four operations that are usually 
performed by the Minivend component for answering a customer request: 

Get request: Receive a request from the Web server and parse it, determining 
the service to be performed and its parameters. 

Process request: Process the request, managing the user session, and perform- 
ing the computations necessary to accomplish the requested service. 

Access database: Request data from the database server. 

Issue response: Build the response page and send it to the Web server. 
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DBMS Monitor. The DBMS monitor provides information about the various 
operations usually performed by database systems: 

Insert: Insert data into a table. 

Update: Update data from a table. 

Connect: Connect to a database in order to perform queries to it. 

Quit: Close the connection to the database. 

Select: Select data from a table. 



Implementation Details. Performance data are collected on a per request 
basis. Thus, for each request submitted, each instrumented component gener- 
ates a log record containing the costs associated with the processing categories 
being measured in that component. Each request is assigned a unique identiDer 
which is used to coalesce the measurements from the various components. The 
results are integrated by an analysis tool that matches the records from diDerent 
logs, adjusts the measurements that are related to other components (e.g., the 
time that the DBMS spends on a request is subtracted from the time that the 
commerce server waits), and generates the data to be displayed. 

The monitoring libraries that collect data for PROFIT were implemented 
using standard Unix system calls [gettimeofday and getrusage) . These functions 
may cause too much overhead if not used properly. Our careful use of these 
functions resulted in a low monitoring overhead. For instance, in the experiments 
presented in Section 2] we saw an overhead of up to 3% as a consequence of 
instrumentation. Our experience is that monitoring is quite scalable, since it 
accounts for a small portion of the overall request cost and the amount of data 
collected is not input dependent, i.e., the number of measurements performed is 
constant for each type of operation. 



3.3 The Interface of PROFIT 

In the Drst version of PROFIT, users visualize the various proDles through a 
Tk-based graphical interface. The main window of PROFIT provides a fast fo- 
cusing mechanism using DpiesD and colors, as follows (see Figure |3]). The pies 
are organized as a table, where each column is associated with a service and 
each row presents data for an architectural component. We can also observe 
that each row and column header contains the name of the service or compo- 
nent and the respective cumulative time (FigureElB and C). The radius of the 
pies are proportional to the time spent at the associated service and component. 
Each slice in the pie represents the amount of time associated with a category, 
which may be easily identiDed through the reference bar above the pies (Figure El 
A). By clicking on the pie we obtain detailed information about the operations 
performed by each component and service. 

By analyzing this display, users may easily identify the sources of client- 
perceived latency in terms of component, service, and process activity. For ex- 
ample, in Figure Elwe can observe that the service checkout presents the highest 
Process cost, both at the Minivend and DBMS levels. 
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Fig. 3. PROFIT Interface 



4 Case Studies 

In this section we show the use of PROFIT for evaluating and analyzing the 
performance of e-cominerce sites. 



4.1 Experimental Setup 

We implemented two experimental e-commerce sites for illustrating the use and 
applicability of our approach. The Drst site is based on Mini vend and implements 
a demo art store that comes with that server. The second site is a C-based im- 
plementation of a bookstore server that we built, which is completely compatible 
with Minivend. As discussed in the sections that follow, the use of two diDerent 
implementations not only demonstrates the generality of PROFIT, but also al- 
lowed us to compare the performance provided by the diDerent implementation 
approaches. For the sake of simplicity, we will refer to these sites as Dart storeD 
and DbookstoreD throughout our discussion. 

Both sites are organized in three levels; Web server, commerce server, and 
database server. The Web server is an Apache 1.3.9 server running on a 200MHz 
dual PentiumPro. The commerce server of the art store executes a Minivend cat- 
alog 3.14 on a 200 MHz quad PentiumPro. The commerce server of the bookstore, 
as mentioned, is a C-based implementation that runs on the same machine. The 
database server runs MySQL 3.22.22 on a 200MHz Pentium server. The client 
is a Pentium 200Mhz PC with 64Mb RAM. All machines are on the same LAN 
and execute LINUX 2.2.5. 
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4.2 Workload Generation 

The experiments presented are based on two workloads, one for each store. 
We generated both workloads for the experiments using a modiDed version of 
Surge that handles sessions, i.e., sequences of requests issued by a customer. 
For the art store experiments, each session comprises at most 34 requests, includ- 
ing 14 images. The types of requests and the CBMG for the session is depicted 
in Figure m The bookstore experiments are based on a real bookstore trace com- 
prising 123,664 requests distributed over 3,177 user sessions (38.92 requests per 
session on average). We also downloaded the product database of the same store, 
comprising 4,309 books. 



Static Static Static Static 




During each experiment we usually started Dve processes, and each process 
emulates two clients (i.e., each process starts two threads). Each experiment 
lasted for 30 minutes. The results presented in the sections that follow show 
the averages obtained in three experiments. The variance of the measurements 
among all experiments are within 3%. Due to space limitations, we summarize 
our results in tables and graphs. In reality, these results are displayed in various 
PROFIT screens. 

4.3 Example 1: Overhead of Secure Connections 

In this section we use PROFIT to quantify the overhead of SSL transactions by 
comparing the proDles of requests executing on secure and non-secure servers. 
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Table 1. ArtStore: Secure and Non-Secure Services for US customers 





Secure 


Non-Secure 


Service 


Average 
Response 
Time (sec) 


Throughput 

(ops/min) 


Size of 
Response 
(Bytes) 


Average 
Response 
Time (sec) 


Throughput 

(ops/min) 


Size of 
Response 
(Bytes) 


Home 


5.49 


6.7 


283 


5.55 


7.0 


270 


Index 


7.68 


6.7 


7366 


7.19 


7.0 


6263 


Browse 


7.38 


13.1 


9542 


6.54 


13.8 


3531 


Search 


6.48 


19.4 


9495 


6.36 


20.5 


8969 


Add 


7.26 


12.9 


6224 


6.87 


13.7 


5861 


Checkout 


7.52 


9.1 


11010 


6.93 


10.1 


3748 


Login 


6.10 


4.7 


4667 


6.00 


4.7 


4405 


Static 


1.03 


91.9 


4366 


0.98 


97.1 


4292 



In order to quantify this overhead we performed a set of experiments employing 
both secure and non-secure connections. We performed experiments for both the 
ArtStore and for the BookStore. 

The proDles for the ArtStore are shown in Table [U We notice that the use 
of SSL connections increased the server response time (i.e., weighted average 
response time over all business services) from 5.81 to 6.10 seconds. Furthermore, 
the throughput of the non-secure version was 10% greater (for checkout) than 
the secure server. 

By verifying the proDles of each component, we can see that the use of secure 
connections aOected both Apache and Minivend. Apache has to decrypt and en- 
crypt the messages exchanged with the client, while Minivend has to parse the 
tags inherent to secure connections and to generate secure response pages. For 
instance, the costs for Browse services in Apache is 5% higher while using secure 
connections. The overhead for secure connections in the Checkout service, how- 
ever, is even higher, since its parameters (i.e., customer data) account for almost 
1 Kbyte. In this case, the processing cost, as measured by PROFIT, increased 
from 6.93 to 7.52 seconds per request for secure connections, an increase of 8.5%. 
The costs for Minivend operations also show the same trend, although almost 
all the diDerences are within 6%. The processing cost, which includes parameter 
parsing increased by 11.8%. Finally, the costs of database operations do not vary 
signiDcantly with secure connections, and we observe that the lower throughput 
in higher levels resulted in a less overloaded database in secure servers, which, 
for instance, opened connections 11% faster (6976 microseconds in the secure 
server compared to 7982 microseconds in the non-secure server). 

The results for the Bookstore are similar, as shown in Table El where we can 
observe that the use of secure connections increased the client response time from 
0.1 to 0.6 seconds on average. This overhead is specially signiDcant for lower cost 
requests such as Home and Static. Checking the Apache measurements provided 
by PROFIT, we verify that the costs of receiving and sending encrypted data 
is about 40% higher than the costs for handling raw data. Thus, requests that 
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transmit more data (e.g., search) are more aDected by secure connections. The 
remaining levels of our Bookstore were not aDected by the secure connections, 
since the additional data is not even parsed by the commerce server and the 
queries to the database do not change at all. 



4.4 Example 2: US vs. International Customers 

Early experiments with Minivend have shown that service response times for 
international customers, mainly Checkout, cost more than the same service for 
US customers. In order to identify the source of performance degradation, we 
performed an experiment where all customers were international and compared 
its PROFIT proDles to another experiment where all customers were American 
(Table [ 21 ). 

We can observe that the Checkout time, for instance, increased from 6.93 to 
47.7 seconds, a 688% increase. By verifying the performance proDles from each 
component, we observe that the highest increase in response time, as expected, is 
associated with the service Checkout, explained by a very large number of select 
operations per service execution (607). However, all services that are handled 
by the commerce server increase while handling just international customers. 
The Minivend proDles given by PROFIT show that the commerce server waits 
three orders of magnitude longer for database operations, explaining the increase 
in all services. Checking the Minivend source code, we learned that checkout 
services performed by international customers involve several select operations 
to retrieve information about the various countries. As a result, the high cost 
associated with checkout operations also aDect other operations performed by 
the Minivend server, which also has to wait for performing database operations. 



Table 2. ArtStore; Services for international and US customers 





International Customers 


US Customers 


Service 


Average 
Response 
Time (sec) 


Throughput 

(ops/min) 


Size of 
Response 
(Bytes) 


Average 
Response 
Time (sec) 


Throughput 

(ops/min) 


Size of 
Response 
(Bytes) 


Home 


36.60 


0.9 


270 


5.55 


7.0 


270 


Index 


46.74 


0.9 


17303 


7.19 


7.0 


6263 


Browse 


22.26 


1.7 


8629 


6.54 


13.8 


3531 


Search 


10.48 


2.2 


16154 


6.36 


20.5 


8969 


Add 


12.06 


1.5 


16332 


6.87 


13.7 


5861 


Checkout 


47.69 


0.8 


7356 


6.93 


10.1 


3748 


Login 


7.67 


0.3 


10815 


6.00 


4.7 


4405 


Static 


0.99 


11.4 


5155 


0.98 


97.1 


4292 
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4.5 Example 3: Comparing Apache Modules to CGI 

Our third example uses PROFIT to quantify the gains associated with the use of 
Apache modules instead of the standard CGI approach. The Common Gateway 
Interface (CGI) was intended to be a glue that could easily bridge between the 
web protocols and other forms of information technology. The main issue with 
CGI is that each time the web server needs a to run a CGI script, it mnst setup 
the CGI environment and start a new process to run the script. The Apache web 
server provides an interface which allows the development of modules that can 
be dinamically loaded at startup and run as part of the main server code. It is 
possible, for example, to convert the code for a CGI script into an Apache module 
eliminating the most expensive part of a CGI processing which is the creation 
and management of a new process. Thus, we implemented an Apache module for 
our Bookstore and compared its performance to the standard implementation 
using CGI. The client response times can be seen in Table and show that the 
use of modules did not improve the performance of our bookstore signiDcantly 
(up to 7 %). 



Table 3. BookStore: Client response times 





Non-Secure 

Module 


Non-Secure 

CGI 


Secure 

Module 


Service 


Average 
Response 
Time (sec) 


Throughput 

(ops/min) 


Average 
Response 
Time (sec) 


Throughput 

(ops/min) 


Average 
Response 
Time (sec) 


Throughput 

(ops/min) 


Home 


0.58 


88.6 


0.61 


88.7 


0.73 


88.3 


Browse 


1.61 


5.1 


1.64 


5.1 


1.82 


4.9 


Search 


12.66 


39.1 


12.50 


39.2 


12.08 


38.6 


Add 


4.51 


2.9 


4.56 


2.9 


4.61 


2.7 


Static 


0.54 


186.5 


0.57 


187.5 


0.66 


169.0 


Select 


5.92 


16.5 


5.86 


16.4 


5.59 


16.7 



4.6 Example 4: Evaluating Store Scalability 

Our last example uses PROFIT for evaluating the scalability of our Bookstore. 
In order to Dnd and quantify possible bottlenecks, we varied the number of 
clients that submit requests simultaneously to the store from 1 to 12. We ob- 
served that the number of clients did not aDect signiDcantly the response time for 
Home and Static requests. Among the remaining services. Search was the most 
aDected by an increase in the number of clients; its response time grew from 
1.97 to 14 seconds. We further investigated the sources of performance degrada- 
tion by analyzing the category breakdown for search operations as measured by 
PROFIT. The bar graph of Figure E] shows the cost distribution for the various 
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Fig. 5. Bookstore: Search Profile 



conDgurations. We can see clearly that the increase is mostly explained by the 
costs associated with queries to the DBMS, which is not scaling and demand 
reevahiation. 

5 Conclusions and Future Work 

This paper describes the design and implementation of PROFIT, a new tool for 
proDling the performance of e-commerce sites. PROFIT measnres both throngh- 
pnt and response time. It also breaks down the response time in terms of com- 
ponents (e.g., Web server, application server, and database server) and services 
(e.g., search, browse, select, add to cart, and pay). We have shown four examples 
of the use of PROFIT, where the tool helped us to identify performance prob- 
lems. We are currently extending the tool to obtain a breakdown of kernel time. 
This will allow us to quantify the communication and disk costs, as discussed 
in [Tj. We are extending PROFIT to also monitor kernel activity, and correlate 
this activity with client requests. 
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1 Introduction 

The tool FiFiQueues analyses a general class of open queueing networks, con- 
sisting of queueing stations with limited or unlimited buffer capacity and with 
arbitrary connections between them. The external arrival processes and the ser- 
vice processes are not limited to Poisson processes but are defined by the first 
and the second moments of the underlying general phase-type distributions |5]. 
In fact, the class of queueing networks that can be analysed with FiFiQueues 
supersedes the model class supported by Whitt’s Queueing Network Analyser 
with one server per node. The first extension of Whitt’s QNA that sup- 
ported queueing stations with limited buffer capacity was “QNAUT” [4] which 
was developed at the University of Twente. FiFiQueues replaces some of the 
approximations made in “QNAUT” and employs new and faster algorithms. 

2 Tool Description 

2.1 Model Specification 

A model specification defines the following queueing network model parameters: 

— The number n of nodes (queueing stations) in the network. 

— The characteristics of external arrivals to node i (for i = l,...,n) given 
by the arrival rate Aoy and the squared coefficient of variation (7 q ^ of the 
interarrival time. 

— The service time distribution of node i given by the service rate pi and the 
squared coefficient of variation ^ of the service time. 

— The queueing capacity of node i (if limited) . 

— A flag indicating whether the traffic stream leaving node i consists of the 
served customers (default behaviour) or of the customers that have not en- 
tered the node due to a full queue. 

The connections between the nodes are described by the routing matrix R where 
its element gives the routing probability for the connection from node i to 
node j. Currently two user-interfaces are supported, a textual one and a graphical 
one (Figure [I]). 
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Fig. 1. Graphical model specification 



2.2 Evaluation Results 

FiFiQueues evaluates the following performance characteristics: 

Traffic flow results: For each node i, the characteristics of the traffic entering 
the node (Aa,i and j) and leaving the node {Xd,i and Cj J are computed. 
Additionnaly, FiFiQueues computes the offered load pi to node i, the real 
(accepted) load pi, the customer loss probability bi and the first two moments 
of the interloss time distribution. 

Node performance results. For each node i the expected queue length E 

the expected waiting time E \W{] and their variances, Var [Ni] and Var \Wi], 
are computed. 

Network wide results: This includes the expected number of customers in 
the network E[7V], the expected sojourn time of an aggregate customer in 
the network E [T] and their variances Var [N] and Var [T], 

3 Employed Algorithms 

In this section we give a very short introduction in the operation of FiFiQueues. 
A more sophisticated description is available in |^. 

FiFiQueues computes the steady-state performance; it thereby assumes that 
there are no dependencies between the queueing stations except for the network 
connections. This means that we only need to know the input traffic character- 
istics of a node to analyse it. In the following two sections, we explain how the 
inter-node traffic characteristics are computed and how the nodes are analysed. 

3.1 Traffic Analysis 

The queueing and service behaviour at a node influences the characteristics of 
its departure process, and hence the characteristics of the arrival processes at 
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successive nodes. Since we allow network configurations with feedback there are 
mutual dependencies between the nodes: the input of a node may indirectly de- 
pend on its output. We apply a fixed-point iteration technique to resolve these 
mutual dependencies. At the start of the computation, FiFiQueues only knows 
the traffic characteristics of the external arrivals. Based on this information it 
analyses the nodes (see next section) and obtains a first approximation for the 
departure streams of the nodes. This intermediate result is used as the starting 
point in the next “round” of the analysis. The iteration stops when the differ- 
ence between two consecutive intermediate results is smaller than the desired 
precision. 

Figure [2l shows the incoming traffic to the first node of a 3-node network 
with feedback (shown in Figure CJ as a function of the number of iterations in 
the fixed-point procedure. As can be observed, the fixed-point is reached after 
a very small number of iterations. This fast convergence has been typical for all 
queueing networks we have analysed so far. 




Fig. 2. Incoming traffic to a node as a function of the number of iterations in the 
fixed-point procedure 



3.2 Node Analysis 

When FiFiQueues analyses a queueing node it represents the arrival and the 
service process by phase-type distributions [Sj. It has some freedom to select an 
appropriate PH-distribution since the user defines only the first and the second 
moment. To leave the required computation as efficient as possible, FiFiQueues 
selects the PH-distribution with the smallest number of phases that matches the 
first two moments. 

To analyse a queueing station we use the PH-distributions for the arrival 
and the service processes to build a CTMC. We then compute the steady-state 
probabilities of the CTMC. Using Bocharov’s results |T] and the steady-state 
probabilities we are able to compute the desired performance measures. In par- 
ticular, based on the arrival- and the service-time characteristics we are able 
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to compute the departure process characteristics exactly. The CTMC of infinite 
nodes (i.e., nodes with unlimited buffer capacity) is analysed by the spectral 
expansion method and the CTMC of finite nodes is analysed by solving its 
global balance equations [2]. Note that only per- queue CTMCs are created. In 
this way we avoid the state-space explosion problem. 



4 Outlook 

We are continuing our work on FiFiQueues. Our studies have resulted in new 
requests for tool enhancements, e.g., support for route-based networks as needed 
to analyse network switches with cross-traffic. 

Additional information concerning FiFiQueues, including a user manual is 
available via WWW: 

http : / /www-lvs . informatik.rwth-aachen.de/tools/index.html 
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Abstract. Galileo is a prototype software tool for dependability analysis 
of fault tolerant computer-based systems. Reliability models are specified 
using dynamic fault trees, which provide special constructs for modeling 
sequential failure modes in addition to standard combinatorial fault tree 
gates. Independent modules are determined automatically, and separate 
modules are solved combinatorially (using Binary Decision Diagrams) or 
using Markov Methods. 



1 Introduction 

A fault tree model is a graphical representation of logical relationships between 
events (usually failure events). Fault trees were first developed in the 1960’s to 
facilitate analysis of the Minuteman missile system, and have been supported by 
a rich body of research since their inception. The quantitative analysis of a fault 
tree is used to determine the probability of system failure, given the probability 
of occurrence for failure events. 

A fault tree consists of the undesired top event (system or subsystem failure) 
linked to more basic events by logic gates. The top event is resolved into its 
constituent causes, connected by AND, OR and M-out-of-N logic gates, which 
are then further resolved until basic events are identified. Recent advances in 
fault tree technology have adapted the fault tree model to handle specific com- 
plexities associated with today’s computer-based systems. These advances in- 
clude the use of Binary Decision Diagrams for solution of traditional fault tree 
models |3I7I4I8I2| the definition of new fault tree constructs to capture dynamic 
behavior |S1, and the incorporation of coverage models for the analysis of fault 
tolerant systems m- These new techniques have allowed the fault tree model, 
long appreciated for its concise and unambiguous representational form, to be 
applicable to the analysis of complex computer-based systems. 

The key to the Galileo fault tree analysis methodology is the judicious combi- 
nation of the BDD and dynamic fault tree (Markov chain) approaches, including 
coverage modeling for both approaches, in a single modular solution of a fault 
tree model. Instead of choosing between the two approaches, we automatically 
divide the overall fault tree model into independent subtrees and use the BDD 
approach where we can, and the Markov approach only where it is needed. 
Thus we utilize a combination of complementary approaches in a single solution 
methodology. 
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2 A Mission Avionic System Example 

We demonstrate the capabilities of Galileo fault tree analysis tool on an example 
mission avionics system, whose block diagram is shown in figure [l] and whose 
modularized fault tree is shown in figure 




Fig. 1. MAS example system architecture 



One processing unit is required for the crew station functions, local path gen- 
eration, and mission and system management. Each of these processing units is 
supplied with a hot spare backup to take over control when the primary proces- 
sor has encountered an error. The scene and obstacle and vehicle management 
subsystems both require more functionality than one processing unit can pro- 
vide. Thus, each uses two processing units. In addition to the hot spare backups, 
two additional pools of spares are provided, each containing two hot spare pro- 
cessing units. Si and S 2 can be used to cover the first two processor failures in 
the subsystems other than the VMS. VMS\ and VMS 2 cover the failures in the 
VMS subsystem. The subsystems are connected via two triplicated bus systems, 
the first is the data bus, the second is the mission management bus. The VMS 
has an additional triplicated bus, the vehicle management bus. 

The system fails if any one of the subsystems cannot function correctly, 
or both the memories fail, or all the busses in any one type fail. The input 
parameters and results are shown in Table [H 



3 Obtaining Galileo 

Galileo is available, free of charge, under license for evaluation purposes only 
(URL: http:/ /www. cs.virginia.edu/~ftree/). Galileo 2.1 Alpha runs on Microsoft 
Windows 95, 98, and NT. It requires either or both of Microsoft Word (95 or 
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Fig. 2. Modularized Fault tree for the MAS example system 



97) and Visio Corporation’s Visio Technical (4.1 to 5.0). If Internet Explorer is 
available, Galileo will use it to browse the user documentation from within the 
tool. Installation involves the unzipping of the archive, and the execution of a 
standard setup program. 



4 Current Work 

Under contract to NASA Langley Research Center, we are working to expand 
the capabilities of Galileo to allow the analysis of multiple-phased missions, and 
to produce a sensitivity analysis of the results with respect to input param- 
eters. Specifically, this work entails the integration of the modular approach 
with approaches to analyzing multiple phases, using both combinatorial and 
Markov-based techniques, including coverage models. Since sensitivity analysis 
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Table 1. Model parameters and results for MAS system 



Model Parameters 


Model Results 


Processor Memory Bus 


Module Unreliability 


Failure rate (per Hour) 2.5e-5 l.Oe-6 2.5e-6 

Permanent Coverage 0.85 0.45 0.99 

Transient Restoration 0.10 0.50 0.00 

Uncovered Failure 0.05 0.05 0.01 

Mission Time = 10 hours 

Total Unreliability= 2.532e-4 


a 1.5-e4 

b 7.5e-7 

c 7.5e-7 

d l.Oe-6 

e l.Ole-4 

f 7.5e-7 

g l.Oe-4 



for Markov chains can be expensive, especially for large models, we are investi- 
gating approximate methods. 
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1 Introduction 

Mobius is a system-level performance and dependability modeling tool. Mobius 
makes validation of large dependability models possible by supporting many dif- 
ferent model solution methods as well as model specification in multiple modeling 
formalisms. 

The motivation for building the Mobius tool was the observation that no 
formalism has shown itself to be the best for building and solving models across 
many different application domains. Similarly, no single solution method is ap- 
propriate for solving all models. Furthermore, new techniques in model specifi- 
cation and solution are often hindered by the necessity of building a complete 
tool every time a novel concept is realized. We deal with these three issues by 
defining a broad framework in which new modeling formalisms and model so- 
lution methods can be easily integrated. In this context, a modeling framework 
is a formal, mathematical specification of model construction and execution. In 
implementing the framework we define an abstraet funetional interface p[j, which 
is realized as a set of functions that facilitates intermodel communication as well 
as communication between models and solvers. This abstract functional inter- 
face also allows the modeler to specify different parts of the model in different 
formalisms. 



2 Mobius Framework 

We begin with a brief overview of the concepts of a formalism and a model in 
the Mobius framework. The Mobius framework provides a very general way to 
specify a model in a particular formalism. We define a formalism as a language 

* This material is based upon work supported by DARPA/ITO under Contract No. 
DABT63-96-C-0069 and the National Science Foundation’s Next Generation Soft- 
ware Initiative. 
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for expressing a model within the Mobius framework, frequently using only a 
subset of the options available within the framework. 

We define models within the Mobius framework using a few basic concepts. 
A model is a collection of state variables, actions, groups, and reward variables 
expressed in some formalism. Briefly, state variables hold the state information 
of the model. State variables may be simple integers, as in Petri net places, or 
complex data structures. Actions change the state of the model over time. They 
may have a general delay distribution and a general state-change function, and 
may operate by any one of several execution policies. A group is a collection 
of actions that coordinate behavior in some specific way. Reward variables are 
ways of measuring something of interest about the model. They embed a state 
machine to allow path-based reward variables. 

Although the basic elements of a model are very general and powerful, for- 
malisms need not make use of all the generality. In fact, it may be useful to re- 
strict the generality in order to exploit some property for efflciency. The purpose 
of some formalisms is to expose these properties easily, and to take advantage 
of them for effldent solution. Mobius was designed with this in mind. 

For convenience, it is useful to classify models into certain types. The most 
basic category is that of “atomic models.” An atomic model is a self-contained 
(but not necessarily complete) model that is expressed in a single formalism. 
Several models may be structurally joined together to form a single larger model, 
which is called a composed model. Naturally, a composed model is a model, and 
may itself be a component of a larger composed model. A model that is more 
loosely connected by the sharing of solutions is called a connected model. Next, 
we describe how we implement this framework as a tool. 



3 Mobius Tool 

The first step in implementing the Mobius framework is to define the abstract 
functional interface that is at the core of the tool. We have implemented the 
functional interface as a set of C-|— I- base classes from which all models must be 
derived. In doing so, we define the functional interfaces as pure virtual methods. 
This requires that any formalism implementor define the operation of all the 
methods in the functional interface. In the same fashion, we construct C-I--I- 
base classes for other Mobius framework components, including actions, groups, 
state variables, and reward variables. Each of these entities also has methods 
that are part of the abstract functional interface. 

The Mobius tool architecture (see Figure HJ is separated into two different 
logical layers: model specification and model execution. All model specification in 
our tool is done through Java graphical user interfaces, and all model execution 
is done exclusively in C-I--I-. We decided to implement the executable models 
in C-|— I- for performance reasons. Every formalism has a separate editor for 
specifying a particular piece of the model. Editors produce compilable C-| — h 
code as output so that the final executable model is specified entirely within 
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C++. The C++ files produced by the editor are compiled, and the tool links 
the object code with formalism libraries and solver-specific libraries. 

After the abstract functional interface was specified, we implemented an 
atomic model formalism, two composed model formalisms, and a reward model 
formalism inside the Mobius framework. This first set of formalisms proves that 
sophisticated modeling formalisms can be integrated into an extensible modeling 
tool. Because of our past work with UltraSAN, we chose to reimplement Ultra- 
SAN’s atomic, composed, and reward model formalisms inside the Mobius tool; 
we also implemented a new composed model editor. 



Project Manager 








1 








Solver 




Libraries 






Libraries 





Executable Model 



Fig. 1. Mobius Architecture. 



Currently, our tool contains the following model specification editors: 

SAN Editor An atomic model editor in which the user can specify models 
using the stochastic activity network formalism [3]. 

Replication- Join Composed Model Editor This editor allows the user to 
specify a composed model by using two composed model constructs: replicate 
and join [5j. 

Graph Composer Editor This editor allows the user to construct a composed 
model through an arbitrary graph of submodels connected through shared 
state [2]. 

Rate-Impulse Reward Editor This editor allows the user to specify reward 
variables whose values are determined by a set of state-based rate and im- 
pulse functions |4]. 

Study Editors Through all phases of model specification, global variables can 
be used as input parameters. These editors allow the modeler to specify the 
values of those global variables. 
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Discrete Event Simulator This generic simulator allows any model to be 
simulated for transient or steady-state reward measures. It also allows the 
simulation to be distributed across a heterogeneous set of workstations, re- 
sulting in a near-linear speed-up. 

State-Space Generator This module creates a Markov process description for 
a model that has exponentially distributed delays between state changes. The 
output of the state-space generator is used as an input for many different 
analytical solvers. 

Analytical Solvers There are several analytical solvers implemented in the 
Mobius tool. They include both transient and steady-state solvers. 

In the process of developing these first Java interfaces, we constructed several 
Java class packages that facilitate the construction of graphical user interfaces 
for the Mobius tool. Having such utilities should minimize the amount of time 
required to implement a specification module for a new formalism or solution 
technique. 

4 Future Directions 

The next important step in the development of the tool will be to implement 
more formalisms in the Mobius framework to show that it is truly an extensible 
architecture. There are many different atomic model formalisms that could be 
implemented, including queuing networks, GSPNs, reliability block diagrams, 
stochastic process algebras, and fault trees. There are also many opportunities 
to explore connection formalisms and model solution methods that use specific 
knowledge of reward measures to reduce the cost of solution. 

We also plan to store all solver results in a results database. The results 
database will be coupled with a results browser capable of submitting sophis- 
ticated queries. This will allow a modeler to create detailed reports of model 
results. With the results database, a user will be able to look at the results from 
different model versions across multiple solution techniques. A visualization tool 
will also be provided to display model results visually. The default format for 
model documentation and report generation will be HTML. The user will have 
the ability to launch an application form the tool to view the HTML output. 
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Abstract. MRMSolve is a new analysis tool developed for the evalua- 
tion of large Markov Reward Models (MRM) that provides the moments 
of the accumulated reward and the completion time. MRMSolve is based 
on the Java technology, hence it allows to access the tool from any node 
connected with the Internet as long as it possesses a Java-enabled Web 
browser. 



Introduction. Markov reward models have been effectively used for the per- 
formance and preformability analysis of computer and communications systems. 
The evaluation of the distribution of reward measures (i.e., the accumulated re- 
ward and the completion time) is a computationally hard procedure, hence it 
can not be performed for MRMs with a large state space (10^-10® states). A re- 
cently introduced numerical technique allows to evaluate the moments of reward 
measures of MRMs with large state space [ 2 ]. The MRMSolve tool is based on 
this numerical technique. 

MRMSolve is composed by an analysis engine and the graphical user interface 
(GUI). The analysis engine is a C-|— I- implementation of the numerical method 
introduced in [2|. The GUI provides a world wide access to the analysis engine 
using the JAVA technology. 

In this summary we focus on the structure and the GUI of MRMSolve and 
we do not enter the details of modeling and analysis of real systems with MRMs. 



The Structure of MRMSolve. MRMSolve makes the use of the client server 
architecture, as it is depicted on Figure |2] The server runs on a powerful remote 
machine, while the client program can be downloaded to any JAVA enabled 
computer that is connected with the Internet. The MRM to be analyzed and 
the required performance measures are defined on the GUI. When the model 
definition is complete and correct the model is uploaded to the server machine, 
which performs the (usually time consuming) computation and downloads the 
result to the client. 

* This work was supported by OTKA T-30685 and F-23971. 
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-Modal Descriptions - 

Rate matrix (Q) 

Reward matrix (R] | C:\R.rn 
initial distr. (PO) I 
Model Name 




pArtalyaia Parameters 

Moments | 6 | |Accumulated Reward ▼[ 

Time Points M ^ ^ | 

Precision | 0.00000^ 

Status 




Open 



Open 



Open 



J F 



Get staus 


Calculate 


Stop 1 





g^;\QJB= 

State Space 
a : 0 To 4 
b : 0 To 2 



Rate Matrix 

(a,b) - (a+1,b) = 1 

(a,b) - {a,b+1) = 2 



Save I 



Fig. 1. The input screen of MRMSolve 

In the design of MRMSolve it is considered that the client machine is not 
necessarily powerful and the connection between the client and the server can be 
slow depending on the condition of Internet. Based on these considerations the 
computing effort required at the client side is minimized and an effective data 
representation is used to describe and upload MRMs with large state space as 
it is discussed below. The data size of the results are usually negligible. 



Graphical User Interface. MRMSolve can be started with appletviewer or 
with any Java enabled web browser from the following address: 
http : / / Indus . hit . bme . hu/~ MRMSolve After starting the Java applet the In- 
put screen of MRMSolve appears (Figure [I|. On this screen the name of the 
model, and the model data (generator matrix, reward rate vector and the initial 
distribution vector) has to be defined. There are two ways to define input ma- 
trices and vectors. Existing model description files can be selected and opened 
using the ’’Open” button, or new model descriptions can be created using the 
built-in editor by clicking on the ’’New” button. Any opened model description 
files can be modified using the same editor. 

The consistency of the opened description file can be checked using the 
’’Check” button. The required results and the parameters of the analysis can 
be defined in the lower left block of the input screen, i.e., the number of the mo- 
ments, time (reward) points, required precision. The calculation can be started 
by hitting the ’’Calculate” button and can be stopped with ’’Stop” button. After 
starting calculation the client applet uploads all model parameters to the server 
where a C-|— I- programs evaluates the model. The data communication with the 
client is implemented in Java in the server side as well. During the computation 
the actual status of the evaluation (e.g. uploading model, processing rate ma- 
trix description, calculating, processing result file,...) is indicated on the input 
screen in the ’’Status” field. The text file of the obtained results are shown on 
the Results screen. 
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Table 1. The syntax of matrix description 



State Space 

#Section 1 

< vari > : < initial -valuei > To < endjvaluei > 

< var 2 > : < initial -Value 2 > To < endjvalue 2 > 

#Section 2 

< boolean.expr > 

Matrix 

{vari, vav 2 ) — (< expr >, < expr >) = < expr > 

(< expr >, < expr >) — (vari, var 2 ) = < expr > 

< boolean.expr > : (vari, var 2 ) — (< expr >, < expr >) = < expr > 

< boolean.expr > : (< expr >,< expr >) — (vari,var 2 ) = < expr > 



Table 2. The rule-based description of the example 



State Space 


State Space 


State Space 


a : 0 To 4 


a : 0 To 4 


a : 0 To 4 


b : 0 To 2 


b : 0 To 2 


b : 0 To 2 


a -I- 2 * 6 <= 4 


a -1- 2 * 6 <= 4 


a + 2 *b <= 4 


Rate Matrix 


Reward Vector 


Initial Vector 


(a, b) — (a + l,b) = 1 
(a, b) — (a, b + 1) = 2 
(a, b) — (a — 1, b) = 5 * a 
(a, b) — (a, b — 1) = 4 * b 
(a, b) — (a — 1, b + 1) = 3 * a 


(a, b) = a 


(a,b) = 1 



Data Representation. The applied client-server architecture requires an effec- 
tive description of large MRM models at the client side, because it would not be 
possible to upload some Mbyte large generator matrix of a MRM of some hun- 
dred thousand states. To reduce the computational requirements at the client 
side and the amount of uploaded data a rule-based matrix (vector) description 
is applied. Basically a set of rules has to be provided at the client side that is 
used to build the generator matrix at the server side. This way only the set 
of rules are uploaded (~ KByte). This rule based model description is simple 
and effective if the MRM posses a nice structure. In case of rather complicated 
MRMs a local installation of the analysis engine might help. 

The applied rule based model description (Table 1.) are composed by general 
expression (< expr >) that can contain variables (< var >G Z) defined in ^Sec- 
tion 1, arithmetic operators (+-*/) and several functions (min(.,.), max(.,.), 
sqrt(.), abs(.), power(.,.), log(.), ln(.), . . .). A boolean expression can contain 
variables, arithmetic operators, several functions and logical operators (AND, 
OR, NOT). Comments are denoted by #. 

Application Example. Consider a transmission link of capacity C = 4 Mbps, 
which is offered calls according to a Poisson process belonging two different 
service classes. The calls of the first service class arrive at rate 1, depart at rate 
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Client Server 





Fig. 2. Client server func- Fig. 3. The Markov chain generated based on the 
tions of MRMSolve. model description 

Table 3. Moments of accumulated reward 





E{B{t)) 


E(B{tr) 


E{B{tr) 


E{B{tr) 


E{B{tr) 


E{B{tr) 


t = 1 


0.304 


0.180 


0.144 


0.144 


0.170 


0.232 


t = 2 


0.427 


0.302 


0.281 


0.319 


0.425 


0.649 


t = 5 


0.798 


0.852 


1.111 


1.692 


2.935 


5.688 


t = 10 


1.416 


2.381 


4.581 


9.874 


23.48 


60.92 



5 and require 1 Mbps bandwidth. The calls of the second service class arrive at 
rate 2, depart at rate 10 and require 2 Mbps. If there is enough free capacity 
the calls of the first service class expand their bandwidth to 2 Mbps at rate 
3 and behave as second class calls from that time on. The system starts from 
a uniform initial distribution, which is defined as (a, b) = 1, because the state 
probabilities are automatically normalized by the program. The total amount 
of data transmitted at 1 Mbps capacity over the (0, t) interval is evaluated by 
MRMSolve. The structure of underlying CTMC is shown in Figure El while the 
rule based description of the example is summarized in Table El The first 6 
moments of the data transmitted at 1 Mbps is shown in Table El 

Future Plans. MRMSolve is under development. The next version of the tool 
is going to include the GUI for the analysis of MRMs with rate and impulse 
rewards and the analysis method provided in pp . 
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Abstract. An important goal of the remote terminal emulator-driven 
tests described here was obtaining a representative test workload. Reach- 
ing this goal depended on (i) imposing the test workload in a represen- 
tative manner, (ii) using representative types of user scenarios in that 
workload, and (iii) specifying a representative number of each type. How 
the first two subgoals were reached is briefly described. Achieving the 
third required definition and calculation of em peaking factors. A tool 
for calculating these quantities is described. 



1 Remote Terminal Emulation 

A benchmark test (BT) may be defined as a means of estimating the performance 
of a system by imposing a test workload on it and measuring its performance. 
Remote terminal emulation is one technique for accomplishing a BT; it uses an 
external driver system (including hardware, operating system, and special driver 
software) called the remote terminal emulator (RTE) to impose a workload on 
the system under test (SUT). If the RTE is correctly configured, the SUT cannot 
distinguish whether the workload is imposed by an actual or emulated population 
of remote users. BT events are recorded in a log analyzed after the test to produce 
various performance metrics. 

An RTE test workload is described using scenarios, sequences of computing 
activities described in a vendor-independent fashion. These generic scenarios are 
implemented system-specific scripts written either in a scripting language, or 
in a high-level programming language, or in a combination of both. Scenarios 
may range in complexity from a single step such as, “compile a procedure”, to 
a complex multi-step process such as, “traverse the menus, fill in the electronic 
forms, and initiate the necessary database transactions to create a purchase 
request.” Recent examples of the use of RTEs may be found in [T] and |1]. 

The RTE-driven BTs described here were used to evaluate a variety of proces- 
sor, memory, and peripheral configurations for several large-scale shared-memory 
multiprocessors which supported a single, though complex, database application. 
Users of this application were experiencing frustrating periods during which in- 
teractive response time was poor. The primary goal of these BTs was to represent 
actual system operation as closely as possible, and to use the BT results to pre- 
scribe a remedy. This goal had three components: (i) imposing the workload on 
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the SUT in a representative manner, (ii) specifying representative types of sce- 
narios in the imposed workload, and (iii) using representative numbers of those 
scenarios in the imposed workload. 

This paper focuses on how subgoal (iii) was reached. Further information on 
the conduct of these BTs and on how subgoals (i) and (ii) were achieved may 
be found in [^ . A survey discussing issues relevant to workload characterization 
in general may be found in . 

2 Benchmark Test Workload Definition 

The BT team was able to clearly define the types of scenarios being performed 
on the actual system. However, it was still imperative to determine the number 
of each type to perform during a BT. Based on experience and on constraints 
on SUT availability, the BT team leader decided that an individual BT should 
last two hours. Next, because the poor performance experienced by system users 
occurred during periods of high system utilization, it was decided that each BT 
should represent the workload encountered during the two-hour period with the 
highest utilization on the day with the most database transactions. 

Finally, the number of each type of scenario to perform in a single two-hour 
BT had to be determined. Conveniently, each scenario involved a unique set of 
database transactions, and the database maintained an internal daily log which 
identified and dated each transaction. SQL scripts were written to extract this 
information, with which it was possible to determine the day with the heaviest 
load and the number of times each scenario was performed on that day. However, 
because the database log was not time-stamped, it was impossible to determine 
when during the day the transactions were performed. This was a critical problem 
since there were known to be significant hourly fluctuations in system utilization. 

As recommended in |^, accounting systems are “the easiest way” to gather 
workload data. Jain |3] also notes the usefulness of accounting logs, but states 
that the required log analysis programs are often unavailable. This turned out 
to be the case. The operating system accounting logs which maintained time- 
stamped data on each process were examined to determine the scenario counts 
during the peak period of utilization. Unfortunately, due to the nature of the 
database and application software, there was no obvious connection between the 
scenario types in the application and the process names in the accounting logs. 
Instrumentation of the application to obtain the necessary data was infeasible 
due to the unavailability of the source code and because the team would have 
had to wait for the peak utilization period (the end of the fiscal year) to return. 

The solution was to determine a peaking factor (PF), i.e., a number which, 
when multiplied by a two-hour average scenario count, gives the peak scenario 
count. Obviously, the data required to calculate the scenario PFs would be the 
same (unavailable) data required to calculate the peak scenario count. The ac- 
counting logs, however, were time-stamped, thus permitting the calculation a 
PF for CPU utilization. This PF approximated the PFs for the various scenario 
types. 



Calculation and Use of Peaking Factors 343 



3 Calculating Peaking Factors 

The input to the calculation of the CPU utilization PF was a process accounting 
log, saved each day in /var/adm/pacct. It contains one record for each process 
executed by the system. Each record contains, among other things, the command 
name, process user ID, process start time, (user and system) CPU time used by 
the process, and elapsed time until process completion. 

The data in the pacct file may be used to calculate the values of a CPU 
utilization function u{t), where the independent variable t is time of day. In 
performing this calculation, it is assumed that the CPU time for a process is 
evenly distributed over the (typically greater interval of) elapsed time for that 
process. The area under some segment (taUfc) of the graph of u is the CPU time 
consumed during that time interval. Dividing the area by the length of the time 
interval gives the average CPU time used per unit time. More precisely, 

/ tb 

u{t)dt/{tb - ta). (1) 

Here, each pacct file covered a 24-hour period from midnight to midnight (GMT). 
If Up is the number of CPUs in the system, 0 < m < Up, with the upper bound 
achievable only at 100 percent utilization of all processors. After defining a win- 
dow size w (for these BTs, 2 hours), the maximum average CPU time for this 
window size is defined as follows: 

ptp-\-W 

Uw = max / u(t)dt/w (2) 

Using these two averages, the peaking factor is defined as 

Pw = Uw/u. (3) 

The graph of u on a day with heavy utilization is shown in Figure 1. Times 
shown on the figure are local, u = 5.37 is denoted on the graph by a dashed line, 
while Uw = 11.17 occurred in the shaded window, p^ for this case was 2.08. 

A program pa2pf was implemented to evaluate the two integrals noted above, 
and calculate the peaking factor. Parameters to this program allow the user to 
specify the window size w, as well as to restrict the measurement interval {ta, tb). 
Details of usage are given in the man page accompanying pa2pf. 

4 Using the Peaking Factor 

How is this peaking factor used? Recall that a daily execution count Ci was 
available for each scenario type i. Then the number of times scenario type i 
should be performed in a BT lasting time w is given by 

— 



Pw^i^ / (db ^a) 



( 4 ) 
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Midnight 6 a.m. Noon 6 p.m. 

Fig. 1. CPU seconds used per wall clock second on a 24-way multiprocessor 



Additionally, because the magnitude of the peaking factor has considerable 
influence on the BT workload, several steps were taken to validate it, including 
calculating PFs (i) based only on users accessing a particular database, (ii) 
using varying window sizes, (iii) for days other than the worst, and (iv) based 
on metrics other than CPU time. These results gave confidence in the use of the 
CPU time PF in these BTs. The authors plan a series of statistical studies to 
thoroughly analyze the significance of the data gathered using this tool. 

References 

1. Maria Calzarossa and Giuseppe Serazzi. TEL- A versatile tool for emulating system 
load. In Proceedings of the Fourth International Conferenee on Modeling Technigues 
and Tools for Computer Performance Evaluation, pages 131-148, New York, 1989. 
Plenum Press. 

2. Maria Calzarossa and Giuseppe Serazzi. Workload characterization: A survey. Proc. 
IEEE, 81(8):1136-1150, August 1993. 

3. Raj Jain. The Art of Computer Systems Performance Analysis: Techniques for 
Experimental Design, Measurement, Simulation, and Modeling. John Wiley & Sons, 
New York, 1991. 

4. Ken J. McDonell. Benchmark frameworks and tools for modelling the workload 
prohle. Performance Evaluation, 22(l):23-42, 1995. 

5. Daniel A. Menasce, Virgilio A. F. Almeida, and Larry W. Dowdy. Capacity Planning 
and Performance Modeling: From Mainframes to Client-Server Systems. Prentice 
Hall, Englewood Cliffs, New Jersey, 1994. 

6. William A. Ward, Jr. Performance testing of CEFMS. Technical Report ITL-99-2, 
U.S. Army Engineer Waterways Experiment Station, Vicksburg, Mississippi, June 
1999. 




Reliability and Performability Modeling 
Using SHARPE 2000 



C. Hirel, R. Sahner, X. Zang, and K. Trivedi 

Center for Advanced Computing and Communication 
Department of Electrical and Computer Engineering 
Duke University, Durham, NC 27708-0291, U.S.A. 

{ chir el , kst } @ee . duke . edu 

The SHARPE package, Symbolic Hierarchical Automated Reliability and 
Performance Evaluator, is now 13 years old. A well known package in the field of 
reliability and performability, SHARPE is used in universities as well as in com- 
panies. Many important changes have been made during these years to improve 
the satisfaction of our users. Recently several new algorithms have been added 
and a Graphical User Interface has been implemented. This paper presents the 
current status of the tool. 

1 Introduction 

Assessment of performance, reliability and availability is a key step in the de- 
sign, analysis and tuning of computer systems. Analytic models provide a easy 
and fast way to carry out trade-off studies, answer “what if” questions, per- 
form sensitivity analyses and compare design alternatives. A system designer 
has a wide range of kinds of analytical models to choose from. Each model has 
its strengths and weaknesses in terms of accessibility, ease of construction, effi- 
ciency and accuracy of solution algorithms, and availability of software tools. No 
single kind of model is best, or even necessarily appropriate, for every system 
and every measure of interest. A modeler who is familiar with many different 
kinds of models, can easily choose models that best suit a particular system and 
the kind of measure that is needed at each stage of the design. It is also possible 
to use different kinds of models hierarchically for different physical or abstract 
levels of the system and to use different kinds of models to validate each other’s 
results [3]. We believe that SHARPE is a useful modeler’s “toolchest” because 
it contains support for multiple model types and provides flexible mechanisms 
for combining results so that models can be used in hierarchical combinations. 
SHARPE allows its users to construct and analyze performance, reliability, avail- 
ability and performability models. It gives users direct and complete access to 
the model types without making any assumptions about an application domain. 



2 SHARPE 

SHARPE is a well known modeling tool that is installed at over 280 Sites. The 
package has been ported to most architectures and operating systems. SHARPE 
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combines Flexibility of Markov models and efficiency of combinatorial models, 
and is used for software and hardware reliability/performance/performability 
modeling. A high level view of methods can be seen in Figure [T] 
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Fig. 1. Description of the SHARPE structure 



2.1 SHARPE Menu of Model Types 

SHARPE is used for dependability, performance and performability. SHARPE 
models were designed to answer the question: given time-dependent functions 
that describe the behavior of the components of a system and a description of 
the structure of the system, what is the behavior of the system as a whole as 
a function of time? The functions might be cumulative distribution functions 
(CDFs) for component failure times, CDFs for task completion times, or the 
probabilities that components are available at a given time. The system structure 
might be specified, for example, in the form of a fault tree, a task graph or a 
Markov chain. 

The three model types that are commonly used for reliability and availability 
are reliability block diagrams, fault trees, and reliability graphs. 

The two model types that are commonly used for performance are task graph 
and product-form queuing network. Graph models are commonly used to study 
the behavior of programs or processes that contain concurrency and/or proba- 
bilistic branching. SHARPE graph model assumes no contention for resources. 
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The queueing network model is useful for examining performance in systems 
where limited resources must be shared. 

Modeling any system with either a pure performance model or a pure relia- 
bility/availability model can lead to incomplete, or, at least, less precise results. 
Gracefully degrading systems may be able to survive the failure of one or more of 
their active components and continue to provide service at a reduced level. One 
of the most commonly used technique for the modeling of gracefully degradable 
systems is the Markov reward model (MRM) . But we may use also the following 
model types: Markov chains, acyclic or irreducible semi-Markov chains and gen- 
eralized stochastic Petri nets. SHARPE supports Generalized Stochastic Petri 
Nets (GSPN) as a specification technique for largeness tolerance; GSPN models 
are transformed into Markov chains for analysis. 

Large models can be avoided by using hierarchical model composition. The 
ability of SHARPE to combine results from different kinds of models makes 
it possible to use state-space methods for those parts of a system that require 
them, and use non-state-space methods for the more “well-behaved” parts of the 
system. 



3 Recent Additions to SHARPE Engine 

Several new additions have been made to the SHARPE engine. One of our main 
concerns is to have the ability to easily construct and analyze larger fault trees 
and reduce the time needed to solve the models given by the companies using 
SHARPE. The new requirement of our customers pushed us to optimize some 
parts in the previous package and to incorporate new features in the new version. 
The capability to program new algorithms and easily incorporate them into the 
set of underlying engines is one key in the ma intenance of our tool. 

1. More built-in distribution functions: 

Early in 1998, a new mechanism was put into SHARPE providing new addi- 
tional built-in distributions like erlang, hypoexponential, hyperexponential, 
weibull, defective, mixture and instantaneous component unavailability. 

2. Loop capability in GTMG: 

Early in 1997, SHARPE was modified to support loops in the input language 
specification of Markov models, including Markov reward models. The loop 
feature is use to speed up the construction of the Markov chain when a part 
of this model has a repetitive structure. 

3. New functions and algorithms for fault trees and reliability graphs: 

Three algorithms are implemented in SHARPE for fault tree analysis: series- 
parallel formula (used for fault trees without repeated components [Sj), VT 
algorithm (A multiple inversion (MVI) algorithm to obtain sum of disjoint 
products (SDP) from mincut set [B]) and the factoring/conditioning algo- 
rithm. With the addition of BDD-based algorithm (Binary Decision Di- 
agram), SHARPE can solve very large fault trees. The efficency of BDD 
algorithm is a considerable improvement of the original VT algorithm [7]. 
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Reliability /unreliability can be calculated from BDD, mincut set can be ob- 
tained during the analysis. The event’s contribution to the system reliability 
(importance measures) can also be obtained. 

4. New GSPN functions: 

GSPN (generalized stochastic Petri nets) models fall into categories that 
correspond to the type of the underlying Markov chain: acyclic, irreducible 
(every state reachable from every other state) and phase-type (containing an 
irreducible part and absorbing states). When the GSPN model type was first 
added to sharpe, steady-state measures were provided for all GSPN types 
and transient measures were provided for acyclic and phase-type GSPNs. 
For all GSPN types, transient functions were added that were computed for 
a specified time t numerically using the same “uniformization” algorithm 
[Zj that had been in use for Markov chains for some time. This algorithm 
is a more stable computation then the one used to produce an exponential 
polynomial result, but must be done separately for each value of t. 

4 Graphical User Interface 

The major components of the interface are a model editor, which allows graph- 
ical input of the type of models chosen to define the model and an extensive 
collection of visualization routines to analyze output results of SHARPE. The 
interface provides a high level input format to the SHARPE syntax which pro- 
vides great flexibility to users. The interface was defined to minimize the human 
intervention during the system design process. Gareful consideration was given 
to the design and implementation of the SHARPE interface to facilitate the cre- 
ation of the models but also the use of the hierarchy feature. The “fault tree” 
and “reliability block diagrams” designs are defined to speed up the creation of 
the model. Instead of the use of the “drag and drop” method, the gui generates 
automatically the objects, for example the gates and events for the fault tree 
model. The user can define a Markov chain model by using a matrix, thus he 
can add new nodes and give the rate of any arc. Instead of asking the designer 
to enter each time the failure rates of the most used components, the interface 
offers a database, which contains the name of these components and their failure 
rates. The interface provides a way to plot the results of SHARPE, and it allows 
also the creation of Excel spreadsheets containing these data. The interface is 
designed with Java, which is architecture neutral and portable. 
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Abstract. The Simalytic™ (Simulation/Analytic) Modeling Technique* 
provides a modeling framework to connect the parts of a computer application 
together. This technique uses a general purpose simulation tool as an underlying 
framework and the results of analytic platform-centric modeling tools to 
represent individual nodes in an enterprise model of the application, leveraging 
the modeling tools already being used hy many organizations for the individual 
nodes. The technique addresses performance analysis and capacity planning of 
complex application designs that incorporate client/server systems, either as 
new functions or as front-ends for legacy systems. The performance analyst 
needs to take an enterprise view of the application and predict its performance 
from the user’s point-of-view. Planning the capacity of client/server 
applications requires a tool for understanding the application performance at 
each of the nodes as well as the inter-relationships between them. The Simalytic 
Modeling Technique provides a tool to understand those relationships. 



1 Introduction 



The primary objective of performance analysis of a computer application, such as 
an order entry or inventory control node, is predicting the impact of configuration, 
load, or design changes on application performance. This type of prediction requires 
the use of some type of performance model. There are many such tools that address 
performance analysis for each of the systems in today’s multi-platform environment 
[1, 2]. The Simalytic'*''^ Modeling Technique provides a bridge across these existing 
tools to allow the construction of an enterprise level application model that takes 
advantage of models and tools already in place for analyzing the performance and 
planning the capacity of each system. Simalytic Modeling builds on existing research 
in hybrid modeling using simulation and analytic techniques [3-5] and extends it into 
the area of capacity planning for client/server applications. While not a specific tool 
itself, Simalytic Modeling is a well defined and validated technique to combine the 
results from existing tools to maximize their usefulness. The advantages of using the 
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Simalytic Model Technique are: Rapid Analysis (quick construction with minimal 
effort), Spiral Methodology (model refinement as information becomes available), 
Reuse (direct incorporation of existing tools and techniques). Distributed Model 
Development (distribution of modeling activities to organizations). Applicable Tools 
(use of the most applicable tool for each node) [6]. 

What is Simalytic Modeling? Simalytic (from Sim ulation and An alytic ) Modeling 
is a hybrid modeling technique that uses a general purpose simulation modeling tool 
as a underlying framework and the results of an analytic modeling tool to represent 
the individual nodes or systems. The goal is to predict the future performance of an 
application executing on heterogeneous computer systems by creating an enterprise 
level application model. Detailed descriptions of the technique are presented in [6-9]. 
The use of the technique with business process models, known as Simalytic Business 
Modeling, in presented in [10]. 



2 Simalytic Modeling Methodology 

The two main attributes of the Simalytic Modeling methodology are the ability to 
combine the results from different modeling techniques and the ability to use results 
from the tools an organization already uses for modeling individual nodes. These 
attributes reduce the time and effort to build an enterprise level model of an 
application by using the results from commercially available platform-centric tools or 
existing detailed application models. 

Simalytic Modeling brings together existing performance models (usually 
platform-centric analytic models) and application information (best expressed as 
simulations). It is a hybrid technique that allows some parts of a model to be replaced 
with submodels using different techniques that provide appropriately similar 
functionality and results. A valid model (proven to produce accurate predictions) for 
each node is used to create a submodel in the application enterprise model. An 
enterprise level model is constructed with a very high level simulation model of the 
application, where each node is a single server, that uses a transform function, the 
Simalytic Function^M, to map transaction arrival rates to service times. As the 
simulation model is run, the service time dynamically adjusts at each node depending 
on the transaction arrival rate for the application and the other work at the node. 
Although a Simalytic Model needs substantial amount of information about the 
applications and systems involved, it requires the same level of analysis and 
presentation as individual models once it has been completed and calibrated. The 
phases to create a Simalytic Model are: Workload Analysis, Node Models, Simulation 
Model, Simalytic Model, and Model Analysis. These phases are described in detail, 
along with implementation examples, in [6-8]. 

Workload Analysis Phase: In the Workload Analysis Phase the modeler collects 
information about the application to be modeled. This includes identifying, defining, 
documenting and measuring the application. This phase includes the same type 
workload analysis done for node level modeling efforts, but it must be done for all of 
the systems supporting the application from the enterprise point-of-view in 
conjunction with both the application developers and the end-users. It is a series of 
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trade-offs between what the end-users would like to model and what is realistic, 
considering how the application and the modeling tools work. 

The objectives of this phase are to produce a topology description of the 
application that can be easily and accurately translated into a simulation model, to 
determine the ability to measure the application at each node from the end-user point- 
of-view, and to insure the consistency across the enterprise model. 

Node Models Phase: In the Node Models Phase, all of the nodes supporting the 
application are modeled using the same type of modeling done for system level 
modeling efforts. The node level models are coordinated to integrate the additional 
information about the application from the enterprise point-of-view. 

The objectives of this phase are to build a model of each node taking advantage of 
any existing modeling efforts, to develop a solid predictive model for each node with 
a consistent application transaction response time profile for each node, and to reuse 
existing processes to collect measurement data. 

Simulation Model Phase: In the Simulation Model Phase, the modeler builds an 
overall model of the application with each of the nodes supporting it represented as a 
node or server. This phase uses the information from the Workload Analysis Phase to 
connect each of the nodes together to provide the enterprise view of the application. 

The objectives of this phase are to build a simulation model of the application that 
represents the overall application behavior across the enterprise such that the response 
time at any node can be controlled by the Simalytic Function when it replaces the 
static service time in the next phase. 

Simalytic Model Phase: In the Simalytic Model Phase, the modeler incorporates 
the results of the node models into the overall model of the application. This phase 
uses the information from the prior phases to provide the predictive capabilities to the 
enterprise view of the application. The table of response times and arrival rates from 
the node models is used by the Simalytic Function to extrapolate response times when 
presented an arrival rates not specifically modeled. 

The objective of this phase is to create a enterprise level model that accurately 
reflects the application’s behavior at each node for all expected arrival rates. 

Model Analysis: The next phase uses the Simalytic Model to analyze the 
application. At this point, the Simalytic Model can be used just as any other type of 
model which has been calibrated. How a model is used to answer „what-if“ questions 
is very dependent on the questions themselves, and the completion of each phase may 
identify additional information or requirements for the prior phase. Model analysis 
results from one iteration of this modeling spiral will determine the direction and level 
of detail for the next iteration. 



3 Conclusion 

The most significant advantage of the Simalytic Modeling Technique is improved 
productivity for the entire process. Analytic platform-centric tools produce better 
results (because of their in-depth platform knowledge) much faster (because of the 
use of analytic formulas) than simulation models of each node in the application 
environment. Even though the results from these analytic models are approximations 
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at average arrival rates and service times, the results are more stable, thus requiring 
fewer model runs to achieve the desired confidence. The Simalytic Modeling 
Technique allows the modeler to produce equal or better results with less effort using 
fewer computational resources. If necessary, a given node can be modeled in greater 
detail using the appropriate modeling tool, either simulation or analytic and those 
more accurate results can then be incorporated into the Simalytic Model without 
requiring any modifications to the remaining parts of the model. By following these 
steps for implementing a Simalytic Model, the modeler can rapidly produce an 
application model at the level of detail needed to make business decisions. 

Performance analysis is still fundamental to business success. But just as 
application designs are moving away from single system solutions, modeling 
application performance must move away from single system analysis and begin 
predicting the application across the enterprise. 
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1 Introduction 

The Stochastic Petri Net Package (SPNP) [2] is a versatile modeling tool for 
solution of Stochastic Petri Net (SPN) models. The SPN models are described 
in the input language for SPNP called CSPL (C-based SPN Language) which is 
an extension of the C programming language |8] with additional constructs which 
facilitate easy description of SPN models. Moreover, if the user does not want 
to describe his model in CSPL, a Graphical User Interface (GUI) is available to 
specify all the characteristics as well as the parameters of the solution method 
chosen to solve the model. 

Earlier versions of SPNP provided the capabilities of automatically gener- 
ating and solving Markov reward models starting with the Stochastic Reward 
Nets (SRNs) |2], an extension of Petri Nets. The new version of SPNP provides 
major extensions in three directions: 

1. Non-Markovian SPNs as well as Fluid stochastic Petri Nets (FSPNs) can 
be described and solved. 

2. Besides the analytic numeric solution of Markovian models discrete-event 
simulation is now available. 

3. A user-friendly GUI interface is now available. 

A number of important Petri net constructs such as marking dependency 
of firing times, variable cardinality arc and enabling functions (or guards) [2j 
facilitate the construction of models for complex systems. Also available are 
priorities as well as different resampling policies (preemptive resume, preemtive 
repeat identical and preemptive repeat different) when the (enabled) transition 
is disabled by the firing of a competitive transition and later becomes enabled 
again or when the firing time of a still enabled transition is affected by the firing 
of another transition. The package also allows logical analysis on the Petri net 
whereby any general assertions defined on the Petri net are checked for each 
marking of the net. Hooks are available to solve a set of interconnected models 
via fixed-point iteration. 

The distributions for the firing times of transitions currently implemented in 
SPNP are the following: exponential, immediate, constant, uniform, geometric, 
Weibull, truncated normal, lognormal. Erlang, hyperexponential, hypoexponen- 
tial, Pareto, truncated Gauchy, Poisson, binomial, gamma and beta. This list 
is going to be supplemented by the negative binomial, Gox2, triangular, loglo- 
gistic, and defective Exponential with mass at origin. Another extension being 
considered is the use of samples given in a file by the user. 

B.R. Haverkort et al. (Eds.): TOOLS 2000, LNCS 1786, pp. 354- [3^ 2000. 
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2 Solution Methods 

The solution methods can be divided in two main categories, the numeric- 
analytic methods and the simulation methods. A high level view of methods 
can be seen in Figure [H 
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Fig. 1. Chart for SPNP 



2.1 Analytic-Numeric Methods 

For steady-state evaluation (of a CTMC or a DTMC), the user can choose 
among Steady-State SOR (Successive Overrelaxationl IT31 . Steady-State Gauss- 
Seidelp3], and Steady-State Power method Even if SOR is usually the 
fastest method, its convergence is not always guaranteed. In such cases Gauss- 
Seidel may be found to converge. The Power method is guaranteed to converge 
but the rate of convergence is generally slower than SOR. 

For transient-state solution for the CTMC, the possible methods are stan- 
dard uniformization and uniformization using the Fox and Glynn method for 
computing the Poisson probabilities. A steady-state detection in transient anal- 
ysis is also possible and is very useful for stiff models m 

Parametric sensitivity analysis allows the user to evaluate the effect of changes 
in an input parameter on the output measures. This is useful in system opti- 
mization and bottleneck analysis. 

2.2 Simulation 

Simulation is used to solve the model when analytic-numeric methods fail be- 
cause the state space is too large or because the restrictions on the model are 
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too strong. It is used in particular for non-Markovian SRNs or on (Markovian 
or non-Markovian) FSPNs. Different simulation methods available are the fol- 
lowing: 

— Simulation using independent replications |4j to compute cumulative or av- 
erage instantaneous values up to a fixed simulation time. 

— Simulation using batches [1] to compute steady state measures. A single run 
is then considered and cut in several blocks (assumed to be independent) to 
obtain the confidence interval. 

— Importance splitting techniques (Restart and splitting) which are speed-up 
methods to estimate the probablities of rare events m- The basic idea is 
to split the simulation path when given thresholds are reached to make the 
rare event occur more often. 

— Importance sampling mm to speed-up the simulation using independent 
replications. The idea here is to modify the distributions (or probabilities) 
of transitions to make it more suitable for the analysis. Of course, the (new) 
system is then biased, but this bias (also called likelihood ratio) can be com- 
puted. 

— Regenerative simulation |5I7| to estimate steady-state measures. In this case 
the expectation of a reward function is estimated by considering different 
regenerative cycles between two returns in one given state and dividing the 
estimated accumulated reward during the state by the expectation of the 
length of a cycle. Markov regenerative Petri nets have been introduced in 

n. 

We are currently implementing the following simulation methods: 

— Regenerative simulation with importance sampling Ena, which puts to- 
gether the advantages of regenerative simulation and the ones of importance 
sampling. 

— Thinning with independent replications [lOj . to compute cumulative or av- 
erage instantaneous values up to a fixed simulation time. Thinning is a 
very suitable method to simulate certain stochastic processes including non- 
homogeneous Poisson processes. 

— Thinning with importance sampling j9]. 

— Thinning with batches. 

3 iSPN 6.0 

We have developed an integrated environment for modeling using Stochastic 
Petri Nets, named iSPN. Careful consideration was given to the design and im- 
plementation of iSPN to facilitate the creation of SPN models. iSPN increases 
the power of SPNP by providing a means of rapidly developing stochastic re- 
ward nets. Input to SPNP is specified using CSPL C based SPN Language, but 
iSPN removes this burden from the user by providing an interface for graphical 
representation of the model. 

The major components of the iSPN interface are a Petri net editor which 
allows graphical input of the stochastic Petri nets and an extensive collection of 
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visualization routines to analyze output results of SPNP and aid for debugging. 
Each module corresponds to a page in the software. iSPN provides a high level 
input format to CSPL which provides great flexibility to users. 

The previous development used the scripting language Tcl Tool Command 
Language, developed by Prof. John Ousterhout of U.C. Berkeley, and extension 
Tk, a toolkit for X windows. All the modifications made on the current version 
of SPNP and the development of the SHARPE gui using Java at Duke by the 
same team were good reasons to consider a new evolution for iSPN. The current 
version is also designed with Java, integrated in a commun gui with the SHARPE 
gui. Thus the output of a SPNP model can be used as an input of a SHARPE 
model. The hierarchy feature from SHARPE is reinforced by this commun design. 
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Abstract. Although several tools have been developed for the estima- 
tion of software reliability, they are highly specialized in the approaches 
they implement and the particular phase of the software lifecycle in which 
they are applicable. Also the conventional techniques for software reli- 
ability evaluation, which treat the software as a monolithic entity are 
inadequate to assess the reliability of heterogeneous systems. We present 
here, our tool called Software Reliability Estimation and Prediction Tool 
(SREPT), that offers a unihed framework containing techniques (includ- 
ing the architecture-based approach) to assist in the evaluation of soft- 
ware reliability at several phases of the software lifecycle. 



1 Introduction and Motivation 

Various techniques have been proposed in the literature to evaluate the depend- 
ability of a software product. The tools available today are highly specialized 
in the approaches they implement and the phase of the lifecycle in which they 
are applicable. For example, tools such as McCabe & Associates’ Visual Quality 
Toolset (VQT) [1^ predict software quality based on software complexity met- 
rics, whereas tools such as SMERFS, AT&T SRE Toolkit, SoRel, and CASRE 
use failure data collected during the testing phase to obtain reliability estimates 
of interest. Tools like Emerald |5] and ROBUST p3], pZ] offer the software met- 
rics approach as well as reliability growth model based approaches to analyse 
failure data, but do not aid in release time estimates and do not enable analyses 
based on software architecture. Architecture-based techniques which are gain- 
ing widespread attention due to the deployment of component-based systems 
are not available in the form of specialized tools. We now present the high-level 
architecture of a Software Reliability Estimation and Prediction Tool (SREPT) 
which offers a unified framework for software reliability estimation and prediction 
offering several techniques including the architecture-based approaches. 

2 Architecture of SREPT 

In this section we briefly describe the engines of SREPT in terms of the input 
data they accept for processing and the output they provide. The interested 
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reader is referred to m, [T0| for further study. The SREPT GUI has been imple- 
mented entirely in JAVA. The solution engines have been implemented either in 
JAVA or C depending on how compute intensive they are. Figure [T] shows the 
high level architecture of SREPT. In the post-development, pre-testing phase, 
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Fig. 1. Architecture of SREPT 



SREPT accepts software product/process metrics as input, and produces an es- 
timate of the number of faults in each module using either the fault density 
approach [S] or regression tree modeling technique [g, 0. 

During the testing phase, SREPT offers the user an option of doing anal- 
ysis based on the failure data collected during testing using the enhanced non 
homogeneous Poisson process (ENHPP) model to predict the failure intensity, 
number of faults remaining, coverage and reliability |5]. When using the failure 
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data to obtain reliability predictions, the ENHPP model in SREPT currently 
uses four coverage functions, namely, exponential, Weibull, S-shaped, and log- 
logistic. Exponential, Weibull, and S-shaped coverage functions correspond to 
Goel-Okumoto, Generalized Goel-Okumoto, and S-shaped software reliability 
growth models respectively, which belong to the class of finite failure non homo- 
geneous Poisson process (NHPP) models [^. Log-logistic coverage function was 
proposed El to capture the increasing/decreasing nature of the failure occur- 
rence rate per fault, which was observed during the analysis of some data sets, 
and the former three models were inadequate to capture this behavior. SREPT 
determines the “best” among these models subject to goodness-of-fit, bias and 
bias trend criteria The ability to report confidence interval bounds m 

for the predictions is also planned to be incorporated into SREPT. The ENHPP 
model can also be driven by test coverage measurements obtained during testing 
and the estimate of the number of faults based on software product / process met- 
rics. SREPT thus offers a mechanism to combine software metrics, failure data 
and coverage based approaches to reliability prediction via the ENHPP model. 
Various optimization engines to compute release times of the software subject to 
various constraints such as maximizing reliability, minimizing number of faults 
detected in the field, etc. will also be a part of SREPT. 

Gonventional software reliability models assume instantaneous and perfect 
debugging. SREPT allows the user to analyze the effect of these two factors 
on the residual number of faults in the software. Various metrics of interest such 
as the failure intensity and reliability can now be recomputed to reflect the time 
and resources expended in debugging to obtain realistic estimates. 

SREPT is planned to accept the architecture of the application modeled ei- 
ther as a discrete time Markov chain (DTMG), a continuous time Markov chain 
(GTMG), a directed acyclic graph (DAG), a stochastic Petri net (SPN), a prod- 
uct form queueing network (PFQN), or a non-product form queueing network 
(NPFQN) [ 19 ] and the failure behavior of the individual components specified 
either as a probability of failure (or reliability), a constant failure rate or a time- 
dependent failure rate. The architecture of an application will be combined with 
the failure behavior of its components to provide architecture-based software 
reliability and performance predictions |17| . |18| . 

To analyze the effect of debugging policies and to perform architecture-based 
approaches, SREPT also allows efficient discrete event simulation using the thin- 
ning technique m as an alternative solution method. 
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Abstract. Predicting the performance of a parallel relational DBMS 
executing an arbitrary set of transactions on particular data sets for dif- 
ferent architectural configurations with different data placement strate- 
gies is a non-trivial task. An analytical tool has been developed to assist 
with this task and can be used for application sizing, capacity planning 
and performance tuning. 

1 STEADY 

STEADY |7] (System Throughput Estimator for Advanced Database sYstems) 
is a tool, which is designed to assist both vendor and end-user in predicting the 
performance of parallel relational DBMSs. It provides estimates of performance 
for particular workloads (tables and transactions) on specific machine configura- 
tions (architecture and DBMS). It enables the user to experiment with different 
data placements j5] and to see how the performance will vary with future changes 
to the workload. 

The STEADY tool consists of eight modules grouped into four layers and 
connected by a graphical user interface. The first layer is the Application layer; 
it is concerned with the data and how it is distributed within the system. This 
includes the properties of the relations (covered by the Profiler), the way in which 
each relation is fragmented and how the fragments are assigned to individual 
discs on processing elements (DPTool). 

The second layer is the DBMS layer; it models the processes specific to the 
database system itself. Firstly, it determines how queries are fragmented into 
basic operations and what parallelisation strategies are employed (Query Paral- 
leliser) - as different DBMSs use quite different approaches, which have signifi- 
cant effects on performance. Secondly, this layer provides a characterisation of 
the behaviour of the cache and an estimate of the hit ratio [B] for pages from 
different relations (Cache Model Component). These results are used to produce 
a profile of the atomic operations required on each processing element to handle 
a particular set of queries (Modeller) . 

The third layer is the Machine layer. Using the information on atomic opera- 
tions performed on each processing element, this layer determines resource usage 
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profiles and hence the system bottlenecks and estimated maximum throughput. 
The final layer is the Response Time layer. It uses the accumulated results to 
determine the lengths of the queues at the various resources involved and hence 
the overall response time for individual transactions in the mix at particular 
transaction arrival rates. 

To illustrate this, consider the following SQL query; STEADY’s execution 
plan for this query is shown to the right of it. 

SELECT max ( amount ) Result = aggCselect (scan, 

FROM tablel tablet, 

WHERE key > 0 attr (tablel , 1) > 0), 

max(attr (tablel , 1) ) ) 



This execution plan is converted into an operator tree which reflects the par- 
allelisation and resource scheduling of the target DBMS. Processing elements 
(PEs) are then assigned to execute these operators to produce a parallel exe- 
cution schedule - the output of the Query Paralleliser. The execution schedule 
generated for the query above is given in Fig.[TjL. 




BLOCK; blkl 
MODE: full_depend start 
HOME: peO 1.0; pel 1.0; pe2 1.0; pe3 1.0 
OPERATION_DEFINITION 
loop (pe0:!710.6;pel:!701.2;pe2:1702.0;pe3:!703.7| { 
mean_shared_lock_waiting_lime ; 
read (pe0:d0(1.0),pel:d0(1.0).pe2:d0(1.0),pe3:d0(1.0)l 
pe0:0.0&pel:0.0&pe2:0.0&pe3:0.0; 
loop {peO:19.8;pel:19.8;pe2:19.8;pe3:19.8) ( 
predicate_check; 
group I 

sendbM 155.0 1.0 

pe3-(pe0 0.25; pel 0,25; pe2 0.25; pe3 0.25); 
pe2-(pe0 0.25; pel 0,25; pe2 0.25; pe3 0.25); 
pel-(pe0 0.25; pel 0,25; pe2 0.25; pe3 0.25); 
pe0-(pe0 0.25; pel 0.25; pe2 0.25; pe3 0.25) 

11.0 

} 

1 

END_DEFINITION 



(b) 



BLOCK: blkl 
MODE: full_depend start 
HOME: peO 1.0; pel 1.0; pe2 1.0; pe3 1.0 
RESOURCE_TIME 
group ( 
ssu X; 
pu Y 

) pe0:0.0;pe 1 : 1 ,0;pe2: 1 ,0;pe3: 1 .0; 
loop {pe0:17!0.6;pel:170!.2;pe2:1702.0;pe3:1703.7} | 
mean_shared_lock_waiting_time; 
group { 

pu {pe0(2551),pel(2551),pe2(2551),pe3(2551)l; 
ssu {pe0(40), pel(40), pe2(40), pe3(40)}; 

option I 

(discO (pe0(16), pel(16). pe2(16), pe3(16)} : 
peO: 1 .O&pe 1 : 1 ,0&pe2: 1 ,0&pe3: 1 .0) 

1 

) pe0:0.0&pel:0.0&pe2:0.0&pe3:0.0; 
loop (pe0;!9.8;pel:19.8;pe2:19.8;pe3:19.8| { 
pu 40; 
group I 
group { 
pu A; 
ssu B; 
net C 

} peO: 1 ,75&pe! : 1 ,75&pe2: 1 .75&pe3: 1.75 



END_TIME 



(a) 



(c) 



Fig. 1. Example query (a) execution schedule (b) task block (c) resource block 

The execution schedule is a graph structure with nodes representing relational 
operators and links between nodes representing dependencies between operators 
and to represent various forms of parallelism between two or more operators. 
From this the task block representation (Fig. [Tb) is generated (Modeller), and 
with the allocation of operators to physical resources a resource usage block 
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representation (Fig. [TJ:) is produced (Machine layer). This is used to estimate 
maximum transaction throughput and response time (see Tomov et al. [4j). 

2 Comparison with Process Algebra 

To validate STEADY, its predictions were compared against measured values 
obtained from a large number of benchmark runs. However, this proved a lengthy 
process. As an alternative check on the model, and a comparison of different 
approaches, process algebra was used. 

Process algebras are mathematical theories which model communication and 
concurrent systems. The one used to validate the analytical model was the Per- 
formance Evaluation Process Algebra (PEPA) , which is a stochastic method, 
developed to find the impact of the compositional features of process algebras. 
In PEPA, a system is expressed as an interaction of components which engage 
in activities. These components correspond to the parts of the system or events 
in the behaviour of the system. Each component or event has a behaviour which 
is defined by the activities in which it engages. 

Initially a simple database system was considered. A PE was taken to consist 
of a database instance which has a transaction manager (TM), a concurrency 
control unit (CCU), a lock manager (LM) and a buffer manager (BM). The TM 
receives requests and on receipt, sends the request to the CCU and then waits 
for the results from the CCU. On receiving the request the CCU asks the LM 
for the required locks and awaits confirmation before proceeding. Once the LM 
grants the locks the CCU starts retrieving data via the BM and waits until the 
BM indicates that the request has been completed. Then the CCU requests the 
LM to release the relevant locks and passes the results to the TM. The BM has 
a limited cache for storing data pages, which introduces the possibility of the 
requested data already being in the cache. The LM allows for the probabilities 
that another request is using the desired locks and hence has to wait in a queue 
before being able to grant the locks. 

This simple single PE system expands to a very large state space of over 
144,000 states, which makes it impractical to solve, even using equivalence to 
reduce the model as far as possible. To handle this a decompositional approach 
was developed in which the model is split into submodels, each consisting of one 
or more atomic components and queues were introduced between submodels. 
Details of the techniques used to split the model into submodels and obtain 
solutions for these are given in . 

Using the decompositional approach, not only could solutions be obtained for 
the simple single PE system but also for multiple PEs involving more complex 
models. This enabled the modelling of different architectures and different data 
placements for a given architecture. The most complex model which we have 
constructed is for the Informix XPS DBMS performing simple queries on the 
ICL Goldrush Megaserver with up to twelve PEs. 

To check the validity of STEADY against process algebra the TPC-B bench- 
mark was used. Fig. |2| shows a summary of the results obtained from the TPC-B 
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Fig. 2. Performance of PEPA against STEADY 

experiment for both STEADY and PEPA for configurations of 8 to 12 PEs, 
each with one disc attached. As can be seen, the two sets of results are in good 
agreement. 

3 Conclusions 

STEADY is a performance prediction tool for parallel relational databases that 
predicts the throughput and response time for both complex queries and mixed 
workloads. It can assist users to select a data placement, size a required system 
or tune an existing system. 

This paper describes the design of the tool and compares the performance 
predictions against those obtained using process algebra to confirm the validity 
of the analytical approach used. 
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